161 Commits

Author SHA1 Message Date
Joseph Doherty a02c0ffe36 docs(code-reviews): record Admin-013 (SignalR hub clients cannot authenticate)
Records the post-review finding discovered during browser smoke-testing: the
Admin-003 hub hardening was incomplete — the server-side Blazor HubConnection
clients had no way to authenticate, so hub negotiate 401'd and four cluster
pages threw unhandled 500s. Logged as Admin-013 (High, Error handling &
resilience), Status Resolved, fixed by commits f254539 + 8d5dbb4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 12:29:36 -04:00
Joseph Doherty 8d5dbb46f2 fix(admin): authenticate SignalR hub clients with a bearer-token scheme
The Admin-003 fix gated every SignalR hub with [Authorize], but the server-side
Blazor HubConnection clients had no way to authenticate: the browser's HttpOnly
auth cookie is not reachable from the interactive circuit, so every hub negotiate
returned 401 and the Admin live-update feature was non-functional app-wide
(silently degraded on Hosts/ScriptLog, fatal on the cluster pages).

Introduce a token-based hub auth path:
- HubTokenService mints/validates short-lived tokens using ASP.NET Core Data
  Protection (the same primitive that protects the auth cookie — no signing-key
  management, no new packages). Tokens carry the user's name + roles.
- HubTokenAuthenticationHandler is a custom "HubToken" auth scheme that reads the
  token from the Authorization: Bearer header (negotiate) or the access_token
  query parameter (WebSocket upgrade).
- The "HubClients" authorization policy runs both the cookie and HubToken
  schemes; the hub endpoints use RequireAuthorization("HubClients").
- AdminHubConnectionFactory builds hub connections with an AccessTokenProvider
  that mints a fresh token for the circuit's authenticated user on every
  (re)connect. All six hub-consuming pages now resolve connections through it.

Hub negotiate now returns 200 and the WebSocket upgrades (101); live updates
work. The best-effort try/catch guards added previously are kept as defence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 12:06:29 -04:00
Joseph Doherty f2545392e0 fix(admin): stop SignalR hub-connect failure from crashing cluster pages
The Admin-003 fix gated every SignalR hub with [Authorize]/RequireAuthorization,
but the server-side HubConnection clients on ClusterDetail, AclsTab, RedundancyTab
and RoleGrants cannot forward the browser's HttpOnly auth cookie — so the hub
negotiate returns 401. Those four pages called HubConnection.StartAsync()
unguarded, so the 401 surfaced as an unhandled exception (a 500 page for the
prerendered ClusterDetail, a broken circuit for the others).

Wrap StartAsync/SendAsync in try/catch on all four, matching the established
best-effort pattern already used in Hosts.razor and ScriptLog.razor: the live
banner / live refresh degrades but the page renders. Restoring functional hub
live-updates needs a token-based hub auth scheme (cookie forwarding is not
viable across the prerender/interactive boundary) and is left as follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:56:06 -04:00
Joseph Doherty bbe292a4b4 docs(code-reviews): regenerate index — 126 Medium findings resolved
All Medium-severity code-review findings across the 29 reviewed modules
are now Resolved. The Pending findings table holds only Low-severity items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:29:21 -04:00
Joseph Doherty 0f3b74ad87 fix(server): wire PermissionTrieCache into AuthorizationGate for generation pinning
Core-002 fixed TriePermissionEvaluator to evaluate each request against
the session's bound AuthGenerationId rather than whatever the cache
currently holds. AuthorizationGate.BuildSessionState was not updated at
the same time: it hardcoded AuthGenerationId = 0, so the evaluator's
GetTrie(cluster, 0) call returned null for any generation != 0, causing
every gated operation to silently fail with NotGranted regardless of
actual grants. The 42 gate/matrix/deferred-hardening tests all started
failing as a result.

Fix: add an optional PermissionTrieCache parameter to AuthorizationGate;
BuildSessionState now stamps AuthGenerationId from the cache's current
generation for the session's cluster. AuthorizationBootstrap.BuildGateAsync
passes the cache it creates. All 7 test MakeGate helpers updated to pass
the cache so tests produce a valid AuthGenerationId. 433/433 server tests
now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:25:39 -04:00
Joseph Doherty 7bf2dc49cf fix(driver-twincat): align status-mapper tests with corrected ADS codes (Driver.TwinCAT-011)
The Driver.TwinCAT-011 fix rewrote TwinCATStatusMapper with correct
numeric values from Beckhoff.TwinCAT.Ads 7.0.172 (e.g. DeviceSymbol-
VersionInvalid = 1809 / 0x0711, not 1794 / 0x0702). Pre-existing
StatusMapper_covers_known_ads_error_codes InlineData cases were written
against the old wrong mappings and now fail; StatusMapper_recognises_
symbol_version_changed_code asserted the legacy 0x0702 constant. Update
both test files to match the corrected mapper and add a comment
documenting the correction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:25:25 -04:00
Joseph Doherty e3371a4f68 docs(driver-opcuaclient): correct open-findings count to 2
Driver.OpcUaClient-006, -007, -008, -009, -010, -012, -013, -015 were
resolved in earlier commits; only -011 (Low) and -014 (Low) remain open.
Header was left at 3 after the Medium batch; correct to 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:25:14 -04:00
Joseph Doherty 5130563104 docs(server): update open findings count to 6 after Medium batch
Resolved Server-003, -005, -007, -010, -011, -013 in this batch;
Server-004, -006, -008, -012, -014, -015 remain open (all Low severity).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 11:03:51 -04:00
Joseph Doherty 2dd0bd4198 fix(server): resolve Medium code-review finding (Server-013)
Replace silent Enum.TryParse fallback to None with a ParseSecurityProfile
helper that emits a startup Log.Warning naming the unsupported value and
listing recognised profiles; operators now see the misconfiguration
before any client connects rather than getting an unexplained None posture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 11:03:35 -04:00
Joseph Doherty a00f0338b5 fix(server): resolve Medium code-review finding (Server-011)
Advertise UserName token policy on any non-None security profile when
Ldap.Enabled; emit a startup LogWarning when Ldap.Enabled=true but
SecurityProfile=None so the misconfiguration is surfaced before clients
connect rather than silently producing no credential path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 11:01:43 -04:00
Joseph Doherty 6075254f38 fix(server): resolve Medium code-review finding (Server-010)
Default AutoAcceptUntrustedClientCertificates to false in both
OpcUaServerOptions and Program.cs config fallback, aligning with
docs/security.md; auto-accept is now explicitly opt-in for dev use only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 11:00:24 -04:00
Joseph Doherty fccb529d5f fix(server): resolve Medium code-review finding (Server-007)
Add configDbHealthy parameter to OpcUaApplicationHost; wire a
DbHealthCache (CanConnectAsync cached 10 s) in Program.cs so /healthz
reflects real config-DB reachability instead of the previous always-true
default; /healthz now returns 503 on a DB outage unless stale-config
cache is warm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:59:08 -04:00
Joseph Doherty 8e8199752f fix(server): resolve Medium code-review finding (Server-005)
Add _nodeManagerDisposed field; set it under Lock in Dispose before
detaching the alarm-service handler; check it in OnAlarmServiceTransition
under the same Lock so an in-flight transition cannot dispatch to a
ConditionSink whose DriverNodeManager is being concurrently disposed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:56:01 -04:00
Joseph Doherty 2003b343bf fix(server): resolve Medium code-review finding (Server-003)
Fix ReadRawAsync: correct XML doc from newest-first to oldest-first
(ascending source timestamp per OPC UA Part 11); move maxValuesPerNode
cap inside the time-window filter loop so paging limits apply to
in-window results only, not the whole buffer snapshot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:54:08 -04:00
Joseph Doherty e774b6f88d docs(driver-twincat): update findings.md status fields and open count
Mark findings 003, 009, 010, 011, 012 Status: Resolved (status fields
were missing the update in earlier commits); reduce Open findings
count from 11 to 5 (Low findings 004, 006, 014, 015, 016 remain open).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:52:35 -04:00
Joseph Doherty 3f6b61133e fix(driver-twincat): resolve Medium code-review finding (Driver.TwinCAT-012)
GetMemoryFootprint now returns tagsByName * 256 + nativeSubs * 512 bytes
instead of a hard-coded 0; document that the stream-and-discard symbol
browse leaves no flushable cache so FlushOptionalCachesAsync is a
deliberate no-op.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:50:28 -04:00
Joseph Doherty 40b28e8820 fix(driver-twincat): resolve Medium code-review finding (Driver.TwinCAT-011)
Confirm AdsErrorCode values from Beckhoff.TwinCAT.Ads 7.0.172 and rewrite
MapAdsError with 20 explicit cases. Fix critical bug: AdsSymbolVersionChanged
was 0x0702 (DeviceInvalidGroup) but DeviceSymbolVersionInvalid is 1809
(0x0711); correct constant and all comments. Add BadOutOfService for
DeviceNotReady and BadInvalidState for DeviceInvalidState/PLC-in-Config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:49:38 -04:00
Joseph Doherty f7d6bd12b9 fix(driver-twincat): resolve Medium code-review finding (Driver.TwinCAT-010)
Replace yield break with cancellationToken.ThrowIfCancellationRequested()
in BrowseSymbolsAsync so a cancelled browse propagates as
OperationCanceledException instead of silently completing with a partial
symbol set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:44:16 -04:00
Joseph Doherty 98d8df4adf fix(driver-twincat): resolve Medium code-review finding (Driver.TwinCAT-009)
Swap _devices and _tagsByName to ConcurrentDictionary so ShutdownAsync
Clear() no longer races concurrent TryGetValue calls; store ProbeTask
on DeviceState and await it in ShutdownAsync before disposing the client
and gate, eliminating the probe-disposal race.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:43:35 -04:00
Joseph Doherty 40aa27b64b fix(driver-twincat): resolve Medium code-review finding (Driver.TwinCAT-005)
Inject optional ILogger<TwinCATDriver> (NullLogger default) and log
connect success/failure, ADS read errors, symbol-browse fallback,
native-notification registration failures, and host-state transitions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:42:08 -04:00
Joseph Doherty de43690e0f fix(driver-twincat): resolve Medium code-review finding (Driver.TwinCAT-003)
Reject Structure-typed pre-declared tags in BuildTag at config-parse time
with a clear InvalidOperationException; replaces the previous silent
garbage read (MapToClrType fell through to typeof(int)) and late
NotSupportedException on writes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 10:39:00 -04:00
Joseph Doherty a48b5396dc fix(driver-opcuaclient): resolve Medium code-review finding (Driver.OpcUaClient-015)
Add OpcUaClientMediumFindingsRegressionTests covering write-timeout status code
(009), Byte->UInt16 mapping (010), AutoAccept warning (012), GetMemoryFootprint/
FlushOptionalCachesAsync contract (013), and pre-init lifecycle guards (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:35:44 -04:00
Joseph Doherty 2df614c79e fix(driver-opcuaclient): resolve Medium code-review finding (Driver.OpcUaClient-010)
Map DataTypeIds.Byte to DriverDataType.UInt16 (unsigned family) rather than Int16
(signed family). Update attribute mapping test to assert the correct unsigned mapping
and add Byte/UInt16 to the standard-types theory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:35:37 -04:00
Joseph Doherty 412c4bbd40 fix(driver-opcuaclient): resolve Medium code-review finding (Driver.OpcUaClient-006)
Route all Session mutations through _probeLock so OnReconnectComplete, ShutdownAsync,
and OnKeepAlive cannot race each other when swapping or clearing the active session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:35:11 -04:00
Joseph Doherty 8ceb10d861 Merge branch 'worktree-agent-adfb71e38279b8f48' into feat/scripted-alarm-shelve-routing 2026-05-22 10:22:56 -04:00
Joseph Doherty 607413e19f docs(code-reviews): update Driver.S7 and Driver.S7.Cli findings status
Mark Driver.S7-002, -004, -008, -012, -014 and Driver.S7.Cli-001, -002, -003
as Resolved; update Open findings counts (Driver.S7: 10→5, Driver.S7.Cli: 7→4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:17:48 -04:00
Joseph Doherty 26e7b8140a fix(driver-s7-cli): resolve Medium code-review finding (Driver.S7.Cli-003)
Wrap the InitializeAsync + ReadAsync body in a try/catch so an unreachable PLC
(refused TCP connect, wrong slot) still prints the structured Host:/CPU:/Health:/
Last error: report from driver.GetHealth() instead of crashing with a stack trace.
OperationCanceledException re-throws so Ctrl+C during connect exits cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:17:41 -04:00
Joseph Doherty 086f487786 fix(driver-s7-cli): resolve Medium code-review finding (Driver.S7.Cli-002)
Trim the --type help text on read and subscribe to the implemented set
(Bool/Byte/Int16/UInt16/Int32/UInt32/Float32) and append a one-line caveat that
Int64, UInt64, Float64, String, and DateTime are not yet implemented and will
return BadNotSupported — so the CLI does not advertise options that cannot succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:17:33 -04:00
Joseph Doherty 01a6b6d859 fix(driver-s7-cli): resolve Medium code-review finding (Driver.S7.Cli-001)
Wrap all numeric/DateTime BCL parses in ParseValue with try/catch(FormatException)
and try/catch(OverflowException) that re-throw as CommandException, matching the
existing Bool path. Update ParseValue_non_numeric_for_numeric_types_throws to assert
CommandException (not FormatException), and add an overflow-edge test (Byte value 256).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:17:25 -04:00
Joseph Doherty aeb5fc48ae test(driver-s7): resolve Medium code-review finding (Driver.S7-014)
Add S7TypeMappingTests.cs covering ReinterpretRawValue and BoxValueForWrite —
26 tests verifying every implemented type round-trip (Bool/Byte/UInt16/Int16/
UInt32/Int32/Float32), two's-complement reinterpret semantics (ushort→short,
uint→int), unsupported-type NotSupportedException, and overflow edge cases.
These methods were factored out as internal static in the S7-002/S7-008 commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:17:15 -04:00
Joseph Doherty 909490622d fix(driver-s7): resolve Medium code-review findings (Driver.S7-002, S7-004, S7-008)
S7-002: add inline comment documenting the UInt32→Int32 lossiness in MapDataType,
consistent with the Int64/UInt64 note. Tracked for a follow-up that adds unsigned
DriverDataType members.

S7-004: inject ILogger<S7Driver> (optional, defaults to NullLogger); add structured
log calls for connect success/failure, probe Running/Stopped transitions, and
swallowed poll-loop exceptions, so operators have an event trail via Serilog.

S7-008: restructure WriteAsync catch ladder to mirror ReadAsync — OperationCanceledException
re-throws, NotSupportedException → BadNotSupported, PUT/GET-disabled PlcException →
BadNotSupported/Faulted, genuine PlcException → BadDeviceFailure/Degraded, all
others → BadCommunicationError/Degraded. Health is now updated on every write failure.
Also factor ReadOneAsync reinterpret into internal ReinterpretRawValue and
WriteOneAsync boxing into internal BoxValueForWrite for testability (Driver.S7-014).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:17:08 -04:00
Joseph Doherty b827b0c0a2 fix(driver-s7): resolve Medium code-review finding (Driver.S7-012)
Remove the dead ProbeAddress config surface from S7ProbeOptions and the factory
DTO. ProbeLoopAsync uses Plc.ReadStatusAsync (CPU-status PDU), not a tag-address
read — ProbeAddress was never consumed. The XML doc on Probe is corrected to
describe the ReadStatusAsync-based probe. Existing configs that set probeAddress
are silently ignored by the JSON deserializer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:16:54 -04:00
Joseph Doherty 19a2a81321 fix(driver-modbus-addressing): resolve Medium code-review finding (Driver.Modbus.Addressing-003)
Complete the incomplete Addressing-003 fix: TryParseByteOrder now produces a
diagnostic mentioning "field 2" when a known type-code token (e.g. BOOL) is
supplied in the byte-order slot, so the user is guided to the correct field.
The previous fix only wired the message in the else-branch, which was unreachable
because LooksLikeByteOrderToken(BOOL) returned true first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:16:30 -04:00
Joseph Doherty dd1742e319 fix(driver-modbus-cli): resolve Medium code-review finding (Driver.Modbus.Cli-002)
Reject --region Coils combined with any non-boolean --type with a CommandException
that names the constraint: coils carry a single bit, so only --type Bool is valid.
Without this check a write like "--region Coils --type UInt16 --value 42" would
silently coerce to a coil ON with no diagnostic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:55:16 -04:00
Joseph Doherty 63e4a6baab fix(driver-modbus-cli): resolve Medium code-review finding (Driver.Modbus.Cli-001)
Add --bit-index, --string-length, and --string-byte-order options to
SubscribeCommand, mirroring ReadCommand, and pass them into ModbusTagDefinition
so that BitInRegister and String type subscriptions use the correct bit index and
string length rather than silently defaulting to bit-0 / zero-length.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:55:10 -04:00
Joseph Doherty e7d7b6cb1d fix(driver-modbus-addressing): resolve Medium code-review finding (Driver.Modbus.Addressing-008)
Add ModbusAddressEdgeCaseTests.cs covering the overflow/boundary gaps: empty
trailing parser field (finding -002 regression), multi-dot input, UserVMemoryToPdu
and AddOctalOffset overflow, SystemVMemoryToPdu base+overflow, MelsecAddress
ParseHex overflow, and DRegisterToHolding/MRelayToCoil bank-base overflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:53:12 -04:00
Joseph Doherty ba52c179fd fix(driver-modbus-addressing): resolve Medium code-review finding (Driver.Modbus.Addressing-002)
Reject an empty 3rd field in the address parser by checking parts[2].Length > 0
before the All(char.IsDigit) guard, so a trailing-colon typo like "40001:F:"
produces a diagnostic instead of silently parsing as a scalar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:52:52 -04:00
Joseph Doherty ebfd5d7871 fix(driver-galaxy): fix XML doc comment cref in StatusCodeMap.ToQualityCategoryByte
StatusCode is not a .NET type reference in this assembly — replace the unresolvable
<see cref="StatusCode"/> with prose text so TreatWarningsAsErrors does not fail the
build on the CS1574 unresolved-cref warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:51:17 -04:00
Joseph Doherty 7a7defb59b fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-014)
Add GalaxyDriverInfrastructureTests covering the two gaps identified in this finding
that are not yet tracked by a dedicated test file: GetMemoryFootprint returns a live
registry-derived estimate (Driver.Galaxy-011) and DisposeAsync completes without
deadlock (Driver.Galaxy-007). The remaining items listed in the finding are covered
by earlier resolution commits: stream-fault → recovery → OnDataChange resumes
(EventPumpStreamFaultTests), post-reconnect Rebind (SubscriptionRegistryTests),
StatusCodeMap.FromMxStatus success/failure semantics (StatusCodeMapTests), and
DataTypeMap all seven codes (DataTypeMapTests). Update findings.md header to 4 open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:49:51 -04:00
Joseph Doherty ecc91b0e48 fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-011)
GetMemoryFootprint() returned a constant 0 with a stale "PR 4.4 sets this" comment
even though PR 4.4 shipped the SubscriptionRegistry. Replace with a live estimate:
64 bytes × TrackedItemHandleCount + 256 bytes × TrackedSubscriptionCount. A 50k-tag
set now registers ~3 MB with the server's cache-flush heuristic instead of being
invisible. Returns 0 when no subscriptions are active.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:48:37 -04:00
Joseph Doherty c5f2d91bcb fix(driver-modbus): resolve Medium code-review finding (Driver.Modbus-002)
Clear _tagsByName, _lastPublishedByRef, and _lastWrittenByRef in ShutdownAsync
(via the new shared TeardownAsync helper) so a ReinitializeAsync cycle starts
from a clean state, consistent with the existing _autoProhibited.Clear().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:48:09 -04:00
Joseph Doherty 0f3de4d510 fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-009)
Fix two resource-management bugs in StartDeployWatcher / BuildDefaultHierarchySource:
(a) Replace the discarded `_ = StartAsync(...)` with an explicit task variable that
    surfaces any synchronous InvalidOperationException (called-twice guard) rather than
    silently swallowing it.
(b) Change both StartDeployWatcher and BuildDefaultHierarchySource to use ??= on
    _ownedRepositoryClient so the first client created (by whichever path runs first)
    is reused by the second path, preventing a second GalaxyRepositoryClient from being
    created and the first from leaking past the driver's lifetime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:47:52 -04:00
Joseph Doherty d572a011ef fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-007)
Implement IAsyncDisposable on GalaxyDriver so async sub-component disposals
(EventPump, AlarmFeed, MxSession, MxClient, RepositoryClient) are awaited rather
than blocked on GetAwaiter().GetResult(). DisposeAsync is now the primary path;
Dispose() delegates to it for using-statement compatibility. Each async component's
shutdown is awaited individually with a best-effort catch so a single slow shutdown
cannot prevent the rest of the cleanup sequence from running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:47:00 -04:00
Joseph Doherty d14564839e fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-006)
HashSet<T>.First() enumeration order is unspecified and unstable across mutations, so
the "owner" handle attached to alarm events was non-deterministic when multiple alarm
subscriptions were active. Change _alarmSubscriptions from HashSet to List (preserving
insertion order) and pick [0] — the earliest-registered handle — as the deterministic
owner. The server routes transitions by SourceNodeId, not by handle, so the choice of
handle does not affect delivery to active subscribers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:45:55 -04:00
Joseph Doherty 910a538b19 fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-004)
Add StatusCodeMap.ToQualityCategoryByte(uint) so the StatusCode → quality-byte
mapping lives in one place next to its inverse (FromQualityByte). GalaxyDriver
OnPumpDataChange now delegates to the helper instead of duplicating the shift+switch
inline; a future edit to the OPC UA bit layout cannot silently desync the probe-health
decode. Unit tests in StatusCodeMapTests pin all three category buckets and the
round-trip invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:43:53 -04:00
Joseph Doherty 39a02f6794 fix(driver-galaxy): resolve Medium code-review finding (Driver.Galaxy-003)
StatusCodeMap.FromMxStatus checked `success != 0` to determine success, but the
mxaccessgw proto contract explicitly documents that `success` is not a boolean and
that clients must branch on `category` (MX_STATUS_CATEGORY_OK), not on `success`
alone. Replace the raw field check with `status.IsSuccess()` from
MxStatusProxyExtensions, which requires both `success != 0` AND `category == Ok`.
A worker reporting success=1 with a non-OK category was previously misreported as
Good. Updated StatusCodeMapTests with a regression case covering the inverted scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:42:47 -04:00
Joseph Doherty f9dccaa732 Merge branch 'worktree-agent-ad34cad856c59bbc1' into feat/scripted-alarm-shelve-routing 2026-05-22 09:40:46 -04:00
Joseph Doherty f920de9878 Merge branch 'worktree-agent-af51f33c034e99fd4' into feat/scripted-alarm-shelve-routing 2026-05-22 09:40:46 -04:00
Joseph Doherty b21585767b Merge branch 'worktree-agent-aaf0e64363ca270b1' into feat/scripted-alarm-shelve-routing 2026-05-22 09:40:45 -04:00
Joseph Doherty ee5d7ad51e fix(driver-ablegacy): fix CS9124 build error and update stale status-mapper test
EffectiveCipPath now references ParsedAddress/Profile properties instead
of the captured primary-constructor parameters to avoid CS9124 (param
captured into enclosing type AND used to init a member).

NonZero_libplctag_status_maps_via_AbLegacyStatusMapper updated to pass
(int)Status.ErrorNotFound rather than the stale magic integer -14 that
the old mapper happened to handle but the new enum-based mapper does not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:33:19 -04:00
Joseph Doherty 1a1b3df098 fix(driver-abcip): actually remove PlcTagHandle.cs from the git index
The file was physically deleted and unstaged in the Driver.AbCip-006
commit but the git rm was not included. Committed separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:31:56 -04:00
Joseph Doherty 1158b80c41 docs(driver-abcip): update findings.md resolutions for 005 and 014
Clarify Driver.AbCip-005 resolution: parent Structure tag stays in
_tagsByName (needed by whole-UDT planner + alarm projection); the fix
is in ReadSingleAsync returning BadNotSupported for direct reads.
Update Driver.AbCip-014 resolution text to match the actual test names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:31:18 -04:00
Joseph Doherty 75163f703d docs(driver-ablegacy): update findings.md open count to 3
8 Medium findings resolved (-002 through -012); 3 Low findings remain
open (-005, -011, -013).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:30:58 -04:00
Joseph Doherty 17432bb1a4 fix(driver-abcip): correct Driver.AbCip-005 approach and fix 014 tests
Finding 005 revised approach: keep the parent Structure tag in
`_tagsByName` so the whole-UDT grouping planner can find it (required
for Driver.AbCip-003 opt-in path + alarm projection). Instead, detect a
direct read of a Structure-with-Members in `ReadSingleAsync` and return
`BadNotSupported` rather than Good/null — explicitly documenting the
contract that callers must address member paths. Duplicate-key checks
(scalar and member fan-out) remain.

Finding 014 test corrections: `Structure_parent_tag_read_returns_BadNotSupported`
now asserts the new contract. `Read_UDInt_tag_returns_uint_value_not_negative_wrapped_int`
assertion fixed to use `ShouldBeOfType<uint>()` instead of
`ShouldNotBe(-1)` (Shouldly overflows comparing uint.MaxValue with int).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:30:54 -04:00
Joseph Doherty e3648adcea fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-012)
Consume previously-dead AbLegacyPlcFamilyProfile fields:

- DeviceState.EffectiveCipPath applies DefaultCipPath when the parsed host
  address has an empty CIP path (SLC 500 / PLC-5 misconfigured without /1,0
  now gets the profile-supplied default route). All three tag/parent/probe
  Create() callers updated.
- InitializeAsync validates each tag's DataType against SupportsLongFile /
  SupportsStringFile and throws InvalidOperationException at init time so a
  MicroLogix Long tag or similar fails early rather than at runtime with an
  opaque comms error.
- MaxTagBytes tracked as a follow-up (string/array chunking requires broader
  design work).

Tests added for CipPath fallback and Long/String type validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:30:42 -04:00
Joseph Doherty cec7ab6ec4 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-010)
The test suite lacked coverage for four critical paths: corrupt/null-
deserializing PayloadJson rows, StartDrainLoop timer behavior and backoff
honoring, concurrent EnqueueAsync+DrainOnceAsync stress, and the
outcomes.Count != events.Count cardinality-mismatch branch.

Added tests covering all four gaps (committed across companion findings):
- Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned
- Drain_with_corrupt_head_row_does_not_stall_queue
- StartDrainLoop_honors_backoff_and_slows_cadence_under_retry
- StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy
- StartDrainLoop_records_drain_fault_and_keeps_running
- Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy
- Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows
- Capacity_eviction_increments_evicted_count_on_status
- GetStatus_snapshot_is_consistent_under_concurrent_drain

Updated Open findings count to 2 (Core.AlarmHistorian-008 + -011, both Low).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:30:17 -04:00
Joseph Doherty 228ad42ad7 fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-010)
MapLibplctagStatus now casts the int to libplctag.Status and switches on
named enum members (mirroring AbCipStatusMapper) instead of unverified
magic integers. A strongly-typed Status overload is the canonical path;
the int overload delegates to it. MapPcccStatus is retained with a comment
marking it as the reference mapping for future PCCC-STS inspection.
Tests updated to use Status enum members rather than raw integers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:28:27 -04:00
Joseph Doherty f6d487b167 fix(driver-historian-wonderware-client): suppress xUnit1051 false-positive in ContractsWireParityTests
Add #pragma warning disable xUnit1051 at the top of ContractsWireParityTests.cs.
The xUnit1051 analyser fires on MessagePack's Serialize/Deserialize overloads that
have an optional CancellationToken parameter; these are synchronous parity tests
where the token is not meaningful — the suppression is scoped to this file only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:28:20 -04:00
Joseph Doherty 5718cb5778 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-007)
When WriteBatchAsync returned a wrong-cardinality outcome list, DrainOnceAsync
threw InvalidOperationException after potential delivery — causing duplicate
events on re-drain or permanent queue stall on a deterministic writer bug.

- The throw replaced with log + backoff: mismatch is recorded into _lastError,
  _drainState set to BackingOff, backoff bumped, method returns without applying
  any outcomes, mirroring the writer-exception path.
- Regression test Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows
  asserts rows stay queued, DrainState = BackingOff, LastError populated, and
  that a fixed writer subsequently drains cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:27:55 -04:00
Joseph Doherty 5bf4be7ca9 fix(driver-focas): resolve Medium code-review finding (Driver.FOCAS-012)
Add FocasDriverMediumFindingsTests.cs with regression coverage for the
five Medium findings:

- 003: InitializeAsync throws when tag's DeviceHostAddress is absent
  from Devices (two variants: typo host, wrong port; also happy path)
- 004: DiscoverAsync emits ViewOnly for tags with Writable:true
- 005: GetHealth() is consistent after ten concurrent ReadAsync calls
- 006: Read recovers after the client is externally disposed, creating
  a fresh client rather than wedging with BadCommunicationError
- 012: Factory full-round-trip with all three opt-in config sections
  (FixedTree + AlarmProjection + HandleRecycle) with all subfields

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:27:40 -04:00
Joseph Doherty 6d520c6756 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-005)
Status fields (_lastDrainUtc, _lastSuccessUtc, _lastError, _drainState,
_evictedCount) were written by the drain timer thread and read by
GetStatus() / health-check threads with no memory barrier, risking torn
DateTime? reads and stale DrainState observations.

- Added _statusLock object; all writes to status fields now happen inside
  lock(_statusLock) blocks in DrainOnceAsync and DrainTimerCallback.
- GetStatus() snapshots all fields atomically under the same lock so the
  Admin UI / /healthz endpoint always sees a consistent view.
- Regression test GetStatus_snapshot_is_consistent_under_concurrent_drain
  drives status writes and reads from concurrent threads; asserts no throws.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:27:31 -04:00
Joseph Doherty 0003f76301 fix(scripting): correct System.Threading.Thread enforcement in analyzer
System.Threading.Thread is in the System.Threading namespace (not
System.Threading.Thread), so the existing ForbiddenNamespacePrefixes
entry "System.Threading.Thread" never matched — the namespace prefix
check compared against the type's containing namespace, which is
System.Threading. Move Thread into ForbiddenFullTypeNames (alongside
Environment / AppDomain / GC / Activator) where it is matched by exact
fully-qualified type name, which actually fires. Remove the dead
namespace-prefix entry and document why. The Rejects_Thread_new_at_compile
test now passes. (Core.Scripting-010.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:27:15 -04:00
Joseph Doherty 807ea11187 fix(driver-focas): resolve Medium code-review finding (Driver.FOCAS-006)
EnsureConnectedAsync now disposes and nulls any existing non-connected
client before creating a fresh one via _clientFactory.Create().

Previously the method reused a cached client via ConnectAsync, but a
client disposed by a HandleRecycle race or prior teardown would hit
FocasWireClient.ThrowIfDisposed on every subsequent call, leaving the
device permanently wedged with BadCommunicationError and no recovery
path until ReinitializeAsync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:27:08 -04:00
Joseph Doherty 1c6db86631 fix(driver-historian-wonderware-client): resolve Medium code-review finding (Driver.Historian.Wonderware.Client-009)
Add six previously-missing edge-case tests to WonderwareHistorianClientTests:
(2) WriteBatchAsync transport-drop catch path returns RetryPlease for all events;
(3) InvokeAsync second-attempt-also-fails propagates the exception;
(4) stalled sidecar fires OperationCanceledException within CallTimeout;
(5) HistoryAggregateType.Total throws NotSupportedException via ReadProcessedAsync;
(6) sidecar wrong-MessageKind reply throws InvalidDataException.

Extend FakeSidecarServer with DisconnectBeforeReply, ReplyWithWrongKind, and
StallAfterRequest test knobs to support these scenarios.

Add ContractsWireParityTests.cs (11 tests) to pin the MessagePack byte layout,
round-trip correctness, MessageKind enum values, and Framing constants — catching
silent [Key] index drift between the client and sidecar mirror copies without
requiring a cross-TFM (net10 vs net48) project reference.

Test count grew from 11 to 27; all 27 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:26:56 -04:00
Joseph Doherty d412352b41 fix(driver-focas): resolve Medium code-review finding (Driver.FOCAS-005)
Guard all _health field accesses with Volatile.Read / Volatile.Write.
ReadAsync, WriteAsync, and ProbeLoopAsync run on different threads and
several updates are read-modify-write (new DriverHealth(_, _health.X, _)).
Without volatile semantics a concurrent update can be lost or a stale
LastSuccessfulRead timestamp propagated.  DriverHealth is an immutable
record so Volatile is sufficient — no lock needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:26:46 -04:00
Joseph Doherty 03c2028669 docs(driver-abcip): update findings.md open count after Medium resolutions
6 Medium findings resolved (004, 005, 006, 009, 010, 014); open count
updated from 11 to 5 (all remaining Low severity: 007, 011, 012, 013, 015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:26:44 -04:00
Joseph Doherty 54d51a1d20 fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-009)
InitializeAsync catch block now mirrors ShutdownAsync teardown: cancels
and disposes probe CancellationTokenSources, calls DisposeRuntimes, and
clears _devices/_tagsByName before rethrowing. A caller that catches and
abandons (rather than retrying via ReinitializeAsync) no longer leaves
orphaned probe tasks or libplctag handles alive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:26:42 -04:00
Joseph Doherty 9008c6e7aa fix(driver-abcip): resolve Medium code-review finding (Driver.AbCip-014)
Add regression tests for the Medium findings resolved in this series:
- AbCipDataType_maps_large_integer_types (theory: LInt→Int64, ULInt→UInt64,
  UDInt→UInt32) and Read_UDInt_tag_returns_uint_value_not_negative_wrapped_int
  cover Driver.AbCip-004.
- Structure_parent_tag_is_not_readable_after_member_fan_out,
  InitializeAsync_throws_on_duplicate_tag_name, and
  InitializeAsync_throws_when_member_name_collides_with_independent_tag
  cover Driver.AbCip-005.
- Read_failure_evicts_runtime_so_next_read_creates_fresh_handle covers
  Driver.AbCip-010.
AbCipDriverTests.AbCipDataType_maps_atomics_to_driver_types extended with
LInt/ULInt/UDInt assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:26:31 -04:00
Joseph Doherty f23cea201d fix(driver-focas): resolve Medium code-review finding (Driver.FOCAS-004)
DiscoverAsync now unconditionally emits SecurityClassification.ViewOnly
for every user-authored FOCAS tag.  Previously the SecurityClass was
tag.Writable ? Operate : ViewOnly, but WireFocasClient.WriteAsync always
returns BadNotWritable — advertising Operate misleads OPC UA clients
and the DriverNodeManager ACL layer into granting write permission on
nodes that can never be written.

Updated FocasCapabilityTests.DiscoverAsync_emits_pre_declared_tags to
assert ViewOnly for the writable-by-config tag so it matches the
corrected behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:26:24 -04:00
Joseph Doherty 60ffcfe8bd fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-008)
Mark _health volatile. The record-reference assignment is atomic, but
without an acquire/release memory barrier GetHealth() on another thread
can observe a stale snapshot indefinitely. volatile enforces the barrier
at read and write sites without a lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:26:08 -04:00
Joseph Doherty 5b5e9ad83b fix(driver-focas): resolve Medium code-review finding (Driver.FOCAS-003)
Throw InvalidOperationException at InitializeAsync when a tag's
DeviceHostAddress does not match any entry in the Devices list, naming
both the tag and the unresolved host.  Previously the missing-device
check was guarded by a TryGetValue so a typo silently bypassed
capability-matrix validation and deferred the error to per-read
BadNodeIdUnknown — the opposite of the documented "fail at load" goal.

Also resolves findings 004, 005, and 006 in the same file:
- 004: DiscoverAsync now unconditionally emits ViewOnly for all user
  tags; the Writable config field no longer influences security class
  because the wire backend always returns BadNotWritable.
- 005: All _health reads use Volatile.Read and all writes use
  Volatile.Write so concurrent readers observe a consistent reference
  and read-modify-write sequences capture a stable snapshot.
- 006: EnsureConnectedAsync disposes and nulls any existing
  non-connected client before creating a fresh one, preventing
  ObjectDisposedException loops after a HandleRecycle race or teardown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:25:57 -04:00
Joseph Doherty 7661d1b5dc fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-007)
Runtimes and ParentRuntimes changed from Dictionary to ConcurrentDictionary.
EnsureTagRuntimeAsync and EnsureParentRuntimeAsync now use a per-key
GetCreationLock semaphore with a double-checked pattern: fast-path read
requires no lock; slow-path create+initialize+store is serialised per key
so a concurrent caller waits rather than creating a duplicate runtime that
would be leaked when DisposeRuntimes runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:25:35 -04:00
Joseph Doherty 72728a5d45 fix(driver-abcip): resolve Medium code-review finding (Driver.AbCip-010)
Add `EvictRuntime` helper that removes + disposes a stale
`ConcurrentDictionary` entry. Call it from `ReadSingleAsync`,
`ReadGroupAsync`, and `WriteAsync` on non-zero libplctag status and
transport exceptions so the next call for the same tag re-creates a
fresh handle — mirroring the probe loop's recreate-on-failure pattern.
Value-conversion exceptions (NotSupportedException, FormatException,
InvalidCastException, OverflowException) are not transport faults and
do not evict the handle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:24:45 -04:00
Joseph Doherty 0d10d30b7d fix(driver-historian-wonderware): update findings.md open count after resolving -002 -003 -006 -009
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:24:36 -04:00
Joseph Doherty 1723f5d5cd fix(driver-historian-wonderware): resolve Medium code-review finding (Driver.Historian.Wonderware-009)
Apply _config.MaxValuesPerRead as a bucket cap in ReadAggregateAsync,
mirroring the existing cap in ReadRawAsync. Without this guard a processed
read over a wide time range with a small IntervalMs could accumulate an
unbounded HistorianAggregateSample list; if the serialised reply exceeded
the 16 MiB FrameWriter frame cap WriteAsync would throw and the client
correlation-id wait would hang. Truncation now logs a Warning with a hint
to widen IntervalMs or reduce the time range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:24:25 -04:00
Joseph Doherty 47eac2d84f fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-004)
DecodeValue for Bit with no bitIndex now reads the full 16-bit word via
GetInt16(0) and tests bit 0 instead of GetInt8(0), which only covered the
low byte and silently misread any bit in positions 8..15. The comment
explains the two decode paths (suffix-present vs suffix-absent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:24:19 -04:00
Joseph Doherty 7474631992 fix(driver-historian-wonderware): resolve Medium code-review finding (Driver.Historian.Wonderware-006)
Add exponential backoff (250 ms → 500 ms → 1 s → 2 s → 4 s → 8 s cap) to
PipeServer.RunAsync after each connection-loop exception, replacing the spin
loop that previously pegged a CPU core and flooded the log on persistent errors
such as a duplicate pipe name or a failing PipeAcl.Create. After 20 consecutive
failures the method re-throws so the SCM / NSSM supervisor can restart the
sidecar cleanly. A clean connection (even a short-lived one) resets the counter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:42 -04:00
Joseph Doherty 7d30009dc8 fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-003)
TryParse now rejects three classes of malformed PCCC address:
- Sub-element + bit-index together (e.g. T4:0.ACC/2) — never valid in PCCC
- File number on I/O/S system files (e.g. I3:0, S2:1) — single-letter only
- Sub-element on non-T/C/R files (e.g. B3:0.DN, N7:0.FOO) — only Timer,
  Counter, and Control files carry structured elements

New helper predicates IsNoFileNumberLetter / IsSubElementFileLetter
keep the parser's intent clear. Regression tests added in AbLegacyAddressTests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:35 -04:00
Joseph Doherty a17de80cdb fix(scripting): resolve Medium code-review finding (Core.Scripting-010)
Add ScriptSandboxTests cases for all forbidden-namespace deny-list
vectors that lacked test coverage: System.Threading.Thread,
System.Threading.Tasks.Task.Run (newly denied per Core.Scripting-003),
System.Runtime.InteropServices.Marshal, and Microsoft.Win32.Registry.
The 001/002 type-granular and node-form vectors were already covered by
the -001/-002 resolution commits. All 79 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:29 -04:00
Joseph Doherty a6de04a297 fix(scripting): resolve Medium code-review finding (Core.Scripting-007)
In TimedScriptEvaluator.RunAsync, the catch (TimeoutException) block
now checks ct.IsCancellationRequested before throwing
ScriptTimeoutException, so a caller cancellation that races a timeout
deterministically surfaces as OperationCanceledException regardless of
which WaitAsync observes first. Regression test
Caller_cancellation_wins_even_when_timeout_fires_first added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:20 -04:00
Joseph Doherty 0cc3b23101 fix(alarm-historian): resolve Medium code-review finding (Core.AlarmHistorian-003)
EnqueueAsync used synchronous SQLite I/O (conn.Open / ExecuteNonQuery /
COUNT(*)) on the caller's thread, blocking the alarm-emitting thread under
write contention with the drain worker. The cancellationToken parameter was
silently ignored.

- EnqueueAsync converted to genuine async: OpenAsync / ExecuteNonQueryAsync /
  ExecuteScalarAsync used throughout; ct threaded to every await.
- ApplyPragmasAsync added alongside the existing ApplyPragmas helper so
  the WAL + busy_timeout PRAGMAs are applied on the async open path too.
- EnforceCapacityAsync added to handle capacity eviction on the async path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:14 -04:00
Joseph Doherty 2c571001ca fix(scripting): resolve Medium code-review finding (Core.Scripting-004)
DependencyExtractor.VisitInvocationExpression now additionally checks
that the member-access receiver is the identifier "ctx" before treating
a GetTag / SetVirtualTag call as a ScriptContext dependency. This
prevents spurious dependencies when a script defines a local helper type
with a matching method name and calls it as other.GetTag("X"). Test
Ignores_member_access_GetTag_on_non_ctx_receiver added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:12 -04:00
Joseph Doherty e390e1c067 fix(driver-abcip): resolve Medium code-review finding (Driver.AbCip-009)
The ConcurrentDictionary + TryAdd/dispose-loser pattern for Runtimes
and ParentRuntimes was already applied as part of the Driver.AbCip-008
fix. Recording resolution with evidence rather than applying a
duplicate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:09 -04:00
Joseph Doherty 60366b72c6 fix(scripting): resolve Medium code-review finding (Core.Scripting-003)
Add System.Threading.Tasks to ForbiddenNamespacePrefixes so scripts
cannot use Task.Run / Parallel to spawn background work that outlives
the per-evaluation timeout. Document the unbounded-memory accepted
trade-off and the Task denial rationale in docs/VirtualTags.md (new
"Known resource limits" subsection) and cross-reference from
docs/ScriptedAlarms.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:03 -04:00
Joseph Doherty 02daacbfd0 fix(driver-historian-wonderware): resolve Medium code-review finding (Driver.Historian.Wonderware-003)
Extract the string-vs-numeric value selection from raw and at-time read
loops into a SelectValue helper method. aahClientManaged's HistoryQueryResult
has no data-type field in the bound SDK version, so the heuristic (prefer
StringValue when non-empty and Value==0) is unavoidable; the helper now
documents the limitation explicitly in its XML doc so the known edge case
(numeric tag at exactly zero with a formatted StringValue) is self-evident.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:23:00 -04:00
Joseph Doherty 37945deb0a fix(driver-abcip): resolve Medium code-review finding (Driver.AbCip-006)
`PlcTagHandle` and `DeviceState.TagHandles` were dead scaffolding: the
`ReleaseHandle` no-op never called `plc_tag_destroy` and the dict was
never populated. Removed the file, the dead dict, and its
`DisposeHandles` loop. Updated the `AbCipDriver` class doc to document
that native lifetime is owned by libplctag.NET `Tag.Dispose()` (invoked
from `DisposeHandles`) with the library's own finalizer covering any
GC-collected instances. Two test methods that only exercised the dead
`PlcTagHandle` class removed from `AbCipDriverTests`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:22:42 -04:00
Joseph Doherty c8a237e5e6 fix(driver-ablegacy): resolve Medium code-review finding (Driver.AbLegacy-002)
`current & widthMask` was already applied in `WriteBitInWordAsync` by
the -001 High finding fix, making the 16-bit sign-extension hazard fully
neutralised. No further code change required; mark Resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:22:12 -04:00
Joseph Doherty 205b07f6aa fix(driver-historian-wonderware): resolve Medium code-review finding (Driver.Historian.Wonderware-002)
Normalise req.Events to Array.Empty<AlarmHistorianEventDto>() immediately
after MessagePack deserialization in HandleWriteAlarmEventsAsync. MessagePack
deserializes an absent or explicit-nil array field as null, not Array.Empty,
so a peer that sends a null Events array would trigger a NullReferenceException
on either .Length dereference (no-writer branch or catch block), leaving the
client correlation-id wait hanging with no reply frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:21:46 -04:00
Joseph Doherty 1679344ace fix(driver-historian-wonderware-client): resolve Medium code-review finding (Driver.Historian.Wonderware.Client-002)
Document explicitly that WriteBatchAsync never returns PermanentFail because
the WriteAlarmEventsReply wire contract carries only a bool-per-event (no
unrecoverable/transient distinction). Add a <remarks> XML block explaining
the structural limitation, why poison events retry rather than dead-letter,
and that a coordinated per-event status enum extension to the .NET 4.8
sidecar is a tracked follow-up. Add inline NOTE comments in both the
success and catch paths for discoverability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:21:11 -04:00
Joseph Doherty 5bcbda1685 fix(driver-historian-wonderware-client): resolve Medium code-review finding (Driver.Historian.Wonderware.Client-007)
Introduce DeserializeSampleValue() helper that enforces a 64 KiB per-sample
ValueBytes size cap before calling MessagePackSerializer.Deserialize<object>,
and documents that the default StandardResolver (primitive-only, no typeless
or dynamic-type resolution) is in use. Both ToSnapshots and AlignAtTimeSnapshots
route through the new helper. Add inline XML comments to the two NuGetAuditSuppress
entries in the csproj recording the advisory title, why each does not apply to
this module's primitive-only deserialization, and when to revisit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:20:23 -04:00
Joseph Doherty d5b8c802ce fix(driver-abcip): resolve Medium code-review finding (Driver.AbCip-005)
Structure tags with declared Members no longer register the bare parent
name in `_tagsByName` — reading it would return Good/null, which is
misleading. Clients read individual member paths. Both the member
fan-out and the scalar-tag paths now perform a duplicate-key check that
throws `InvalidOperationException` naming both colliding entries (fail-
fast, consistent with the AbCipHostAddress validation pattern).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:20:22 -04:00
Joseph Doherty 1722c0328b fix(driver-abcip): resolve Medium code-review finding (Driver.AbCip-004)
`ToDriverDataType` mapped LInt/ULInt to Int32 (truncation) and UDInt
to Int32 (negative wrap for values > Int32.MaxValue). DriverDataType
already carries Int64/UInt64/UInt32, so map each Logix 64-bit and
unsigned-32-bit type to the correct member. `DecodeValueAt` in
`LibplctagTagRuntime` updated to return uint/ulong for UDInt/ULInt
so the runtime value type agrees with the declared OPC UA type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:19:40 -04:00
Joseph Doherty 75580fb432 fix(driver-historian-wonderware-client): resolve Medium code-review finding (Driver.Historian.Wonderware.Client-005)
Replace the synchronous non-cancellable _stream.ReadByte() for the kind byte
in FrameReader.ReadFrameAsync with an async ReadExactAsync(new byte[1], ct)
call so the full frame read honours the EffectiveCallTimeout-linked token
and cannot wedge the call gate when the sidecar stalls mid-frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:19:14 -04:00
Joseph Doherty 6bb971c040 fix(driver-ablegacy-cli): resolve Medium code-review finding (Driver.AbLegacy.Cli-001)
WriteCommand.ParseValue wraps FormatException/OverflowException as
CliFx CommandException so a bad --value yields a clean one-line CLI error
naming the value and target type instead of a raw .NET stack trace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:15:19 -04:00
Joseph Doherty 29e656912e fix(driver-abcip-cli): resolve Medium code-review findings (Driver.AbCip.Cli-001, -002)
Driver.AbCip.Cli-001: WriteCommand.ParseValue wraps FormatException/
OverflowException as CommandException so bad --value input yields a clean
CLI error instead of a raw stack trace.
Driver.AbCip.Cli-002: probe/read/subscribe commands reject Structure types
up front (RejectStructure helper), matching the write guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 09:14:41 -04:00
Joseph Doherty e8edf123ff fix(driver-cli-common): resolve Medium code-review finding (Driver.Cli.Common-005)
Added missing test coverage identified in the -005 finding:

- FormatTable_with_empty_input_returns_header_only: verifies the -004 fix
  (empty batch read returns header+separator rather than throwing).
- FormatStatus_with_sub_code_bits_resolves_to_named_class: Theory exercising
  the -002 high-word mask path (e.g. 0x80050001 → "BadCommunicationError").
- FormatStatus_unknown_sub_code_falls_back_to_severity_class: Theory for the
  -002 severity-class fallback (unknown sub-codes still emit Good/Uncertain/Bad).
- New DriverCommandBaseTests class: four tests covering verbose/non-verbose
  Serilog level selection, ConfigureLogging idempotency, and FlushLogging.

Also corrected the stale FormatStatus_unknown_codes_fall_back_to_hex_only
expectation (0xDEADBEEF now resolves to "Bad" via the severity-class fallback
introduced by -002, not bare hex) and fixed the FormatTable empty-input crash
(guard rows.Length == 0 before calling Enumerable.Max).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:38:44 -04:00
Joseph Doherty 7ff356bddc fix(driver-cli-common): resolve Medium code-review finding (Driver.Cli.Common-003)
ConfigureLogging is now idempotent via a _loggingConfigured guard field so
repeated calls from subclasses do not abandon and leak the previous logger.
The previous Log.Logger is disposed before overwriting to release its
console-sink resources cleanly.

A new protected static FlushLogging() helper calls Log.CloseAndFlush() so
commands can guarantee buffered output is flushed in their finally blocks
before the process exits — important for the long-running subscribe verb.

XML doc updated to reflect call-once semantics and document FlushLogging().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:38:09 -04:00
Joseph Doherty 1433a1cf30 fix(driver-cli-common): resolve Medium code-review finding (Driver.Cli.Common-002)
FormatStatus now matches named codes against code & 0xFFFF0000 (high-word
mask) rather than exact equality, so status codes carrying sub-code or flag
bits in the low 16 bits (e.g. 0x80050001) still resolve to their named class.
For codes not in the named shortlist a severity-class fallback using the top
2 bits always emits Good / Uncertain / Bad rather than bare hex.

Updated the stale FormatStatus_unknown_codes_fall_back_to_hex_only test (its
expectation became invalid once the severity-class fallback was added) and
added new Theory cases exercising both the high-word matching and the
severity-class fallback paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:37:47 -04:00
Joseph Doherty 3d8c285034 fix(virtual-tags): resolve Medium code-review findings (Core.VirtualTags-002, -003, -005, -008, -012)
Core.VirtualTags-002: cold-start guard publishes BadWaitingForInitialData
instead of silently returning a stale value.
Core.VirtualTags-003: Load detects duplicate Path values and keys the
upstream-subscription loop off the registered tag set.
Core.VirtualTags-005: VirtualTagSource fires the initial-data callback per
path before registering the change observer, fixing an ordering race.
Core.VirtualTags-008: DependencyGraph caches topological rank, lowering
per-change-event cost from O(V+E) to O(closure).
Core.VirtualTags-012: added 9 engine tests; CoerceResult null-return now
maps to BadInternalError as the code comment intended.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:31:49 -04:00
Joseph Doherty 11612900ba fix(core-abstractions): resolve Medium code-review findings (Core.Abstractions-001, -002, -003)
Core.Abstractions-001: PollGroupEngine compares array values with structural
equality so a driver returning a fresh T[] each poll no longer fires spuriously.
Core.Abstractions-002: PollOnceAsync guards reader result cardinality and
throws a descriptive InvalidOperationException on mismatch instead of a
swallowed ArgumentOutOfRangeException that stalled the subscription.
Core.Abstractions-003: the poll loop Task is tracked; Unsubscribe/DisposeAsync
await loop completion before disposing the CTS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:29:49 -04:00
Joseph Doherty 4dcfaace62 fix(scripted-alarms): update findings.md for resolved Medium findings
Mark Core.ScriptedAlarms-002, -004, -005, -007, -012 as Resolved with
one-line descriptions. Update open-findings count from 11 to 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:24:54 -04:00
Joseph Doherty 69994f9cf6 fix(scripted-alarms): resolve Medium code-review finding (Core.ScriptedAlarms-012)
Add engine-level tests covering the six gaps identified in the finding:
(1) timed-shelve auto-expiry driven via injectable clock + RunShelvingCheckForTest
    hook so timer tests are deterministic;
(2) ConfirmAsync, TimedShelveAsync/UnshelveAsync round-trip, EnableAsync engine
    methods exercised end-to-end;
(3) OnEvent subscriber-throws isolation — engine state advances and stays
    operational after a subscriber throws;
(4) IAlarmStateStore.SaveAsync failure leaves in-memory state unchanged (locks in
    the persist-before-update invariant from finding-007);
(5) second LoadAsync does not leak the old timer (regression for finding-002);
(6) AreInputsReady cold-start guard correctly blocks on Bad/missing inputs and
    allows Uncertain-quality inputs through.

Expose RunShelvingCheckForTest() internal method on ScriptedAlarmEngine to
support deterministic timer tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:24:19 -04:00
Joseph Doherty ce86deca62 fix(core): resolve Medium code-review finding (Core-007)
SubscribeAsync now wraps each driver handle in a private HostBoundHandle
that carries the resolved host name.  UnsubscribeAsync unwraps it and
routes through the recorded host's resilience pipeline, correctly
charging the subscription's originating host's circuit breaker/bulkhead
instead of always using the default host.  Falls back to the default
host for handles not created by this invoker.  Two regression tests
added; update findings.md Open count from 10 to 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:24:17 -04:00
Joseph Doherty 6cec98caef fix(core): resolve Medium code-review finding (Core-006)
BuildAddressSpaceAsync now checks _disposed (throws ObjectDisposedException)
and tears down the previous alarm forwarder + clears the sink registry
before re-walking, so a Galaxy-redeploy rebuild does not leak the old
forwarder and double-deliver alarm transitions.  Three regression tests
added: double-build does not double-fire, sink count is correct after
rebuild, and post-dispose call throws.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:24:08 -04:00
Joseph Doherty debe163f4d fix(core): resolve Medium code-review finding (Core-005)
Change ClusterEntry from sealed record to sealed class so TryUpdate
uses reference equality for the CAS comparison.  Prune now uses a
read-compute-TryUpdate retry loop that restarts when a concurrent
Install updates the entry between the read and the write, preventing
a race that could silently drop the just-installed newest generation.
Two regression tests added to PermissionTrieCacheTests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:23:52 -04:00
Joseph Doherty 09cd579220 fix(core): resolve Medium code-review finding (Core-003)
Add FolderSegment member to NodeAclScopeKind; update WalkSystemPlatform
to report NodeAclScopeKind.FolderSegment (not Equipment) for each
visited Galaxy folder level, so MatchedGrant.Scope in
AuthorizationDecision.Provenance correctly distinguishes Galaxy folder
grants from UNS Equipment grants in the audit trail and Admin UI
diagnostics.  Three regression tests added to PermissionTrieTests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:23:45 -04:00
Joseph Doherty a0b3a4c8a7 fix(scripted-alarms): resolve Medium code-review finding (Core.ScriptedAlarms-007)
Reorder persist/update in ApplyAsync, ReevaluateAsync, and ShelvingCheckAsync:
SaveAsync is now called before the in-memory _alarms entry is advanced. A store
failure therefore leaves both the persisted and in-memory views at the prior state
rather than diverging, maintaining the invariant that startup recovery reflects
actual persisted state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:18:42 -04:00
Joseph Doherty cdaa0da45d fix(scripted-alarms): resolve Medium code-review finding (Core.ScriptedAlarms-005)
Add _disposed re-checks inside ReevaluateAsync and ShelvingCheckAsync after
acquiring _evalGate so callbacks in flight when Dispose() runs bail out cleanly
instead of mutating _alarms or writing to a disposed store. Drop the
_alarms.Clear() from Dispose() — clearing outside the gate races concurrent
reads and is unnecessary since the object is being discarded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:17:50 -04:00
Joseph Doherty 72b5f7a20c fix(scripted-alarms): resolve Medium code-review finding (Core.ScriptedAlarms-004)
Split the LoadAsync seed-read + subscribe loop: ReadTag seed fills _valueCache
first, then persisted-state restore runs, then _loaded = true, then SubscribeTag
is called. Any synchronous initial push from the upstream now arrives after
_alarms is fully initialised and _loaded = true, so ReevaluateAsync will queue
correctly behind the gate rather than racing the half-built state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:16:44 -04:00
Joseph Doherty b75542bbac fix(scripted-alarms): resolve Medium code-review finding (Core.ScriptedAlarms-002)
Dispose any existing _shelvingTimer before reassigning it inside LoadAsync so
that a second LoadAsync call does not leak the old timer and leave two timers
running concurrently against the same engine state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:15:58 -04:00
Joseph Doherty c126fc7a7d fix(configuration): resolve Medium code-review findings (Configuration-002, -003, -006, -009)
Configuration-002: sp_PublishGeneration is transaction-nesting aware
(BEGIN TRANSACTION vs SAVE TRANSACTION on @@TRANCOUNT) so a caller's outer
transaction survives a publish failure; sp_ValidateDraft wrapped in TRY/CATCH.
Configuration-003: ValidatePathLength uses the cluster's actual Enterprise/Site
lengths when available, falling back to the conservative approximation.
Configuration-006: ResilientConfigReader treats a command-timeout
TaskCanceledException as a fault (not caller cancellation) and falls back.
Configuration-009: removed the checked-in plaintext sa connection string;
CreateDbContext now requires OTOPCUA_CONFIG_CONNECTION.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:13:27 -04:00
Joseph Doherty 7e54e1e4a0 fix(client-shared): resolve Medium code-review findings (Client.Shared-001, -002, -007, -008)
Client.Shared-001: lowered the OnAlarmEventNotification early-return guard
from <6 to <1; per-index field guards already default missing fields safely.
Client.Shared-002: GetRedundancyInfoAsync replaces unguarded unboxing casts
with StatusCode.IsGood + Convert.ToInt32/ToByte, defaulting on bad reads.
Client.Shared-007: alarm fallback Task.Run guards on ReferenceEquals(session,
_session) and drops stale alarms on ObjectDisposedException after failover.
Client.Shared-008: WriteValueAsync rejects type inference from bad/null reads;
ValueConverter wraps parse failures in a descriptive FormatException.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:11:23 -04:00
Joseph Doherty aa142f6dd4 fix(client-cli): resolve Medium code-review findings (Client.CLI-001, Client.CLI-005)
Client.CLI-001: parse --start/--end with CultureInfo.InvariantCulture and
DateTimeStyles.AssumeUniversal|AdjustToUniversal so dates are culture-stable.
Client.CLI-005: SDK notification callbacks now hand off to an unbounded
channel drained on the main thread; handlers are unsubscribed before the
summary phase so no notification interleaves with console output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:08:25 -04:00
Joseph Doherty 9f5a5c9997 fix(analyzers): resolve Medium code-review findings (Analyzers-001, Analyzers-006)
Analyzers-001: IsInsideWrapperLambda now matches the wrapper method name
(ExecuteAsync/ExecuteWriteAsync) in addition to the containing type, so a
future non-callSite lambda overload cannot suppress the diagnostic.
Analyzers-006: extended StubSources and added coverage for the remaining
guarded interfaces, synchronous members, concrete-driver receivers,
ExecuteWriteAsync wrapping, and nested lambdas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 08:08:09 -04:00
Joseph Doherty a0aa4a4819 fix(admin): complete Admin-006 — inject IAntiforgery into LogoutAsync for explicit token validation
The previous Admin-006 commit added <AntiforgeryToken /> to the logout form
and updated the comment on the endpoint, but did not update LogoutAsync to
actually call IAntiforgery.ValidateRequestAsync. Blazor's UseAntiforgery()
middleware does not automatically validate minimal-API endpoints, so a
tokenless POST still succeeded. This commit injects IAntiforgery into the
handler, wraps ValidateRequestAsync in a try/catch, and returns 400 on
AntiforgeryValidationException. The endpoint keeps .DisableAntiforgery() to
prevent the middleware from also trying to read the body (which would cause
a double-read). The regression test is updated to log in first (to get an
authenticated session) before asserting 400 on a tokenless logout POST.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 07:51:11 -04:00
Joseph Doherty 1db8736515 fix(admin): update open-findings count in Admin findings.md
Admin-006 through Admin-009 (all Medium) resolved; 3 Low findings remain open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:33:14 -04:00
Joseph Doherty b585429447 fix(admin): resolve Medium code-review finding (Admin-009)
Add AdminAuthPipelineTests (WebApplicationFactory + RoleInjectingHandler) to
enforce that ConfigViewer is denied CanPublish-gated pages while FleetAdmin is
permitted, and that an authenticated FleetAdmin session can reach the homepage.
Existing PageAuthorizationTests (anon page rejection) and AuthEndpointsTests
(login cookie + hub auth) cover cases (a)-(c); this file adds case (d).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:33:03 -04:00
Joseph Doherty 328ab1e614 fix(admin): resolve Medium code-review finding (Admin-008)
Add @ReleasedBy parameter to sp_ReleaseExternalIdReservation via a new EF
migration so the operator principal (not the shared SQL account) is recorded
in ExternalIdReservation.ReleasedBy and ConfigAuditLog.Principal.
ReservationService.ReleaseAsync gains a releasedBy parameter; Reservations.razor
resolves the signed-in user from AuthenticationState and passes it through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:29:54 -04:00
Joseph Doherty 71f91aa57c fix(client-ui): update findings.md — mark Client.UI-001/002/005/007/008 Resolved
Update status and resolution text for the five Medium findings resolved
in this batch; lower the Open findings count from 11 to 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:29:10 -04:00
Joseph Doherty c7f8a00635 fix(client-ui): resolve Medium code-review finding (Client.UI-008)
Implement IDisposable on MainWindowViewModel to detach ConnectionStateChanged,
call Teardown() on the subscription/alarm VMs, and dispose _service so the OPC UA
session and SDK resources are released. Call Dispose() from MainWindow.OnClosing
alongside the existing SaveSettings() call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:28:48 -04:00
Joseph Doherty bdc1f96b5b fix(client-ui): resolve Medium code-review finding (Client.UI-007)
Remove Password from UserSettings and stop writing it to settings.json;
the operator is re-prompted on each launch. Update LoadSettings/SaveSettings
comments and adjust the affected test assertion to verify the password is
not restored from the persisted model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:28:12 -04:00
Joseph Doherty 08f000069c fix(admin): resolve Medium code-review finding (Admin-007)
NewCluster.razor and ClusterDetail.razor now resolve ClaimTypes.Name /
NameIdentifier from the cascaded AuthenticationState instead of hardcoding
"admin-ui" as the createdBy audit field. The operator principal is now
attributed correctly on every cluster-create and draft-create write path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:27:40 -04:00
Joseph Doherty a9cede8ed4 fix(client-ui): resolve Medium code-review finding (Client.UI-005)
Call Subscriptions?.Teardown() and Alarms?.Teardown() in the Disconnected
branch of OnConnectionStateChanged so server-side session drops also
quiesce the DataChanged and AlarmEvent handlers. Add Reattach() methods
that idempotently re-hook the handlers; call them from the Connected
branch so reconnects after a server-side drop restore live updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:27:03 -04:00
Joseph Doherty af454c6af6 fix(admin): resolve Medium code-review finding (Admin-006)
Emit <AntiforgeryToken /> in the MainLayout sign-out form and remove
.DisableAntiforgery() from the /auth/logout endpoint so UseAntiforgery()
validates the token. A tokenless POST now returns 400, preventing CSRF-logout.
Regression-guarded by AuthEndpointsTests.Logout_without_antiforgery_token_is_rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:26:34 -04:00
Joseph Doherty 55c2a5a209 fix(client-ui): resolve Medium code-review finding (Client.UI-002)
Guard the two nullable child VM dereferences (BrowseTree at ConnectAsync
and History at ViewHistoryForSelectedNode) with != null checks, matching
the guarding style already used for Subscriptions and Alarms nearby.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:25:47 -04:00
Joseph Doherty 2816c76c2b fix(client-ui): resolve Medium code-review finding (Client.UI-001)
Route the synchronous IsLoading = true write through _dispatcher.Post so
both IsLoading assignments use the same dispatch path as Results.Clear()
and the final IsLoading = false, eliminating the ordering hazard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:25:18 -04:00
Joseph Doherty 40f06314fb test(virtual-tags): unbreak script-timeout test after sandbox deny-list change
The Timeout_maps_to_BadInternalError_without_killing_the_engine test's
"Hang" script busy-looped on Environment.TickCount64. Commit cfb9ff1
(Core.Scripting-001) added System.Environment to the script-sandbox
deny-list, so the script now fails sandbox validation instead of
reaching the timeout path. Switch the busy-loop to DateTime.UtcNow
(an allowed type) to preserve the test's intent — a self-terminating
~5s hang that overruns the 30ms script timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:06:27 -04:00
Joseph Doherty 371fe2127c docs(code-reviews): regenerate index — 46 High findings resolved
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:01:13 -04:00
Joseph Doherty 5499b817c8 fix(driver-historian-wonderware-client): resolve High code-review finding (Driver.Historian.Wonderware.Client-001)
WonderwareHistorianClient.ReadAtTimeAsync passed the sidecar's reply.Samples
straight through ToSnapshots, which violated the IHistorianDataSource
contract: the result MUST be the same length and order as the requested
timestampsUtc, with gaps returned as Bad-quality snapshots. If the sidecar
dropped or reordered samples, OPC UA HistoryReadAtTime would silently
misalign values with timestamps.

Add an AlignAtTimeSnapshots helper that indexes the returned samples by
timestamp ticks, builds the result array at timestampsUtc.Count in request
order, and emits a Bad-quality (0x80000000) snapshot for any requested
timestamp the sidecar did not return.

Add the ReadAtTimeAsync_PartialAndReorderedReply_AlignsByTimestamp_AndFillsGapsAsBad
regression test where the fake returns a partial, reordered sample set.

Update code-reviews/Driver.Historian.Wonderware.Client/findings.md: -001
Resolved, open-finding count 10 -> 9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:59:40 -04:00
Joseph Doherty f982fa1f69 fix(driver-historian-wonderware): resolve High code-review finding (Driver.Historian.Wonderware-001)
WriteToReadOnlyFile was listed in MalformedErrors, so ClassifyOutcome/
MapOutcome routed it to PermanentFail and the store-and-forward sink
dead-lettered every alarm event in the batch. But WriteToReadOnlyFile is
a connection-configuration fault (the write session was opened without
ReadOnly = false), not an event-payload fault — treating it as permanent
silently and permanently discards alarm events on a misconfigured or
regressed connection, which is data loss.

Move WriteToReadOnlyFile from MalformedErrors into ConnectionErrors. The
batch loop now aborts the batch, resets the connection (so the reconnect
path re-opens a writable ReadOnly = false session), and defers the
events as RetryPlease for the next drain tick.

Updated the ClassifyOutcome theory data and added a dedicated regression
test pinning WriteToReadOnlyFile -> RetryPlease.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:59:40 -04:00
Joseph Doherty 1837b5a828 fix(driver-modbus-addressing): resolve High code-review finding (Driver.Modbus.Addressing-001)
The DL205 family-native branch routed every V-prefixed address through
DirectLogicAddress.UserVMemoryToPdu, a plain octal-to-decimal decode.
DL205/DL260 system V-memory (V40400 and up) is not a simple octal decode:
the CPU relocates the system bank to Modbus PDU 0x2100. Octal-decoding
V40400 produced 16640 (0x4100), the wrong register, so any tag addressing
a system register through the grammar string silently read/wrote the
wrong PLC memory.

- Add DirectLogicAddress.VMemoryToPdu, which decodes the octal V-address,
  detects the system bank (octal >= V40400 == SystemVMemoryOctalBase) and
  relocates it through SystemVMemoryToPdu to PDU 0x2100; user-bank
  addresses keep the plain octal decode.
- ModbusAddressParser's DL205 V branch now calls VMemoryToPdu instead of
  UserVMemoryToPdu. UserVMemoryToPdu is retained for user-bank-only callers.
- Correct the ModbusFamilyParserTests V40400 assertion (16640 -> 0x2100)
  and add system-bank regression cases plus direct helper coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:59:39 -04:00
Joseph Doherty 532b961cf2 fix(driver-modbus): resolve High code-review finding (Driver.Modbus-001)
_lastPublishedByRef was a plain Dictionary<string, object> mutated inside
ShouldPublish, which runs on the PollGroupEngine onChange callback. The engine
runs one background Task per subscription, so a driver with two or more
subscriptions invokes ShouldPublish concurrently on separate threads. Concurrent
TryGetValue/indexer writes on a non-thread-safe Dictionary can corrupt internal
state, drop entries, or throw, crashing the poll loop.

Switch _lastPublishedByRef to ConcurrentDictionary<string, object>; its
TryGetValue and indexer-set operations are individually thread-safe, so the
deadband cache is now correct under concurrent multi-subscription publishing,
consistent with the lock-guarded sibling cache _lastWrittenByRef.

Add an xUnit + Shouldly regression test that runs 24 deadband-configured
single-tag subscriptions concurrently and asserts the poll loop survives without
faulting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:59:39 -04:00
Joseph Doherty 7f2e144f8d fix(driver-galaxy): resolve High code-review findings (Driver.Galaxy-002, Driver.Galaxy-008)
Driver.Galaxy-002 — DataTypeMap.Map had no Int64 arm though MxValueDecoder/
MxValueEncoder both fully support Int64. Galaxy attributes with the Int64
mx_data_type code fell through to the String default, creating a String
address-space node while runtime reads decoded a boxed long. Added
`6 => DriverDataType.Int64`, extending the contiguous 0..5 scheme so the type
map agrees with the decoder/encoder on all seven Galaxy data types.

Driver.Galaxy-008 — after a stream fault the EventPump's StreamEvents consumer
loop exited and its channel completed; EventPump.Start() is a no-op on a
completed-but-non-null loop, so a replayed subscription had no consumer and
ReplayAsync never re-registered the post-reconnect item handles. ReplayAsync
now recreates the EventPump (RestartEventPumpForReplay) and rebinds the
SubscriptionRegistry per subscription with the fresh item handles returned by
the post-reconnect SubscribeBulkAsync, via new SubscriptionRegistry.SnapshotEntries
and Rebind APIs.

Regression tests: DataTypeMapTests (every code incl. Int64), SubscriptionRegistry
Tests (Rebind/SnapshotEntries), EventPumpStreamFaultTests (faulted pump dead,
fresh pump resumes dispatch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:59:38 -04:00
Joseph Doherty c9e446387a fix(driver-focas): resolve High code-review findings (Driver.FOCAS-001, Driver.FOCAS-002)
Driver.FOCAS-001: FocasDriverConfigDto exposed no FixedTree / AlarmProjection /
HandleRecycle sections, so a deployment that opted into those features per
docs/drivers/FOCAS.md had the sections silently dropped by case-insensitive
JSON parsing and the features stayed at their disabled defaults. Added
FocasFixedTreeDto / FocasAlarmProjectionDto / FocasHandleRecycleDto and Build*
mappers in CreateInstance that populate the matching FocasDriverOptions
properties; a missing section or field keeps its existing default.

Driver.FOCAS-002: the fixed-tree bootstrap probe classified ProgramInfo as
"supported" whenever GetProgramInfoAsync returned non-null, but WireFocasClient
.GetProgramInfoAsync substituted defaults instead of throwing on a FOCAS error
return, so a CNC series answering EW_FUNC/EW_NOOPT for cnc_exeprgname2 /
cnc_rdopmode still got the Program/ and OperationMode/ subtrees. The method now
throws InvalidOperationException when neither the program-name nor the op-mode
read is IsOk, so SafeTryProbe correctly suppresses the capability.

Added FocasFactoryConfigTests covering the three opt-in config sections
round-tripping through CreateInstance and the fixed-tree bootstrap classifying
ProgramInfo as unsupported when the probe throws. Added an internal
FocasDriver.Options test seam.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:41:28 -04:00
Joseph Doherty ebc0511c72 fix(driver-opcuaclient): resolve High code-review findings (Driver.OpcUaClient-001..-005)
Driver.OpcUaClient-001 — ReadAsync/WriteAsync/DiscoverAsync captured the
session before acquiring _gate, so a reconnect that completed while the
operation was blocked on the gate left the wire call bound to a stale,
closed session. All three now re-read Session (and parse NodeIds) inside
the _gate critical section after WaitAsync returns.

Driver.OpcUaClient-002 — OnReconnectComplete ignored the give-up (null
session) case, permanently wedging the driver with no Faulted signal and
no reconnect loop. The give-up branch now transitions HostState to
Faulted, sets a Faulted DriverHealth with an explanatory message, and
re-arms a fresh SessionReconnectHandler (TryRearmReconnect) against the
last-known session so an always-on gateway self-heals.

Driver.OpcUaClient-003 — BrowseRecursiveAsync discarded browse
continuation points, silently truncating large remote folders.
It now loops on BrowseResult.ContinuationPoint calling BrowseNextAsync
and appending each page until the continuation point is empty.

Driver.OpcUaClient-004 — driver-specs.md §8 namespace handling was
absent. Added NamespaceMap (built from session.NamespaceUris at connect,
rebuilt on reconnect) which persists discovered NodeIds in the
server-stable nsu=<uri>;... form; reads/writes re-resolve that form
against the current session so a remote namespace-table reorder no
longer misaddresses nodes. Added the TargetNamespaceKind option +
UnsMappingTable and ValidateNamespaceKind startup enforcement.

Driver.OpcUaClient-005 — OnKeepAlive read/wrote _reconnectHandler
without a lock, racing the SDK keep-alive timer thread and leaking
handlers. The check-and-set in OnKeepAlive, the take-and-clear in
ShutdownAsync, and the dispose/re-arm in OnReconnectComplete now all
run inside the _probeLock critical section.

Adds OpcUaClientNamespaceTests (11 xUnit + Shouldly regression tests)
covering ValidateNamespaceKind and the NamespaceMap stable encoding.
Reconnect/browse wire paths remain fixture-gated per finding -015.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:41:28 -04:00
Joseph Doherty 090d2a4b44 fix(driver-s7): resolve High code-review findings (Driver.S7-001, -006, -007, -011)
Driver.S7-001: Timer (T{n}) / Counter (C{n}) addresses parsed cleanly but
the read path had no S7DataType or decode case for them, so a Timer/Counter
tag passed fail-fast init and then threw a misleading type-mismatch on every
read. InitializeAsync now runs RejectUnsupportedTagAddresses, throwing a clear
NotSupportedException ("not yet supported", echoing tag name + address) so the
config error fails fast at init.

Driver.S7-006: ShutdownAsync cancelled the probe/poll CTSs but did not await
the fire-and-forget loop tasks before DisposeAsync disposed _gate, letting a
loop iteration mid-semaphore race a disposed object. The probe task is now
tracked in _probeTask and each poll task in SubscriptionState.PollTask;
ShutdownAsync cancels every CTS, awaits Task.WhenAll of those handles with a
bounded 5 s DrainTimeout, then disposes the CTSs and gate. Task.Run is passed
CancellationToken.None so the handle is always awaitable.

Driver.S7-007: a PUT/GET-disabled fault (permanent misconfiguration) was
mapped identically to a transient PlcException — both BadDeviceFailure +
Degraded. ReadAsync/WriteAsync now split the catch via an IsAccessDenied
filter (S7.Net exposes no typed code for AccessingObjectNotAllowed, so the
inner-exception chain is inspected for the "not allowed" marker). Access-denied
now maps to BadNotSupported and Faulted with a config-alert message pointing
at the TIA Portal PUT/GET toggle; genuine device faults stay BadDeviceFailure.

Driver.S7-011: S7Driver ignored driverConfigJson on Initialize/Reinitialize,
so a config change delivered through ReinitializeAsync (the only Core-initiated
in-process recovery path) was silently discarded. Config parsing was factored
into S7DriverFactoryExtensions.ParseOptions; InitializeAsync now re-parses
driverConfigJson and rebuilds _options whenever the document has a real body.
An empty / placeholder document keeps the constructor options.

Adds S7DriverCodeReviewFixTests covering Timer/Counter rejection, config-json
application on Initialize/Reinitialize, and shutdown-drain with active
subscriptions. All 68 S7 driver tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:41:26 -04:00
Joseph Doherty d89be2a011 fix(driver-ablegacy): resolve High code-review findings (Driver.AbLegacy-001, Driver.AbLegacy-006)
Driver.AbLegacy-001 — PCCC bit-index range. AbLegacyAddress.TryParse
accepted a bit index of 0..31 for every file type, but a 16-bit
N/B/I/O/S/A word only has bits 0..15. TryParse now range-checks the
bit index against the file's word width (0..15 for 16-bit element
files, 0..31 for the 32-bit L file, no bits on float files), so
addresses like N7:0/20 are rejected at parse time instead of silently
truncating in the (short) cast. WriteBitInWordAsync reads and writes
an L-file parent word as 32-bit Long and masks the RMW arithmetic to
the native width, so a sign-extended 16-bit decode can no longer
corrupt the high bits.

Driver.AbLegacy-006 — shared-runtime concurrency. A per-tag libplctag
Tag handle is cached and reused by both the server read path and the
poll loop, with no synchronisation around Read/GetStatus/DecodeValue.
Added a per-runtime SemaphoreSlim (DeviceState.GetRuntimeLock, keyed
by tag name); ReadAsync and WriteAsync now hold it across the whole
Read -> GetStatus -> Decode / Encode -> Write -> GetStatus sequence so
no two threads touch the same Tag handle concurrently.

Added xUnit + Shouldly regression coverage: AbLegacyBitIndexRangeTests
(per-file bit-range validation + L-file 32-bit RMW + sign-extension
safety) and AbLegacyRuntimeConcurrencyTests (overlap-detecting fake
proving concurrent read/read and read/write are serialised).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:41:26 -04:00
Joseph Doherty 8a7668c678 fix(driver-abcip): resolve High code-review findings (Driver.AbCip-001, -002, -003, -008)
Driver.AbCip-001 — ReinitializeAsync silently discarded its config JSON.
Extracted AbCipDriverFactoryExtensions.ParseOptions; InitializeAsync now
re-parses a content-bearing driverConfigJson and replaces _options (and
recreates the alarm projection), so a reinitialize with a changed config
(new device/tag, changed timeout) actually takes effect. A blank or
empty-object JSON keeps construction-time options for the unit-test seam.

Driver.AbCip-002 — libplctag status mapping used wrong integer constants.
MapLibplctagStatus now switches on the libplctag.NET Status enum members
(Ok/Pending/ErrorTimeout/ErrorNotFound/ErrorNotAllowed/ErrorOutOfBounds/…)
instead of hand-typed natives, so timeout/not-found/not-allowed/out-of-bounds
get their specific OPC UA codes instead of all collapsing to
BadCommunicationError. The int overload casts to Status to stay correct
against the wrapper's contiguous renumbering.

Driver.AbCip-003 — whole-UDT reads decoded members at declaration-order
offsets, which Studio 5000 does not guarantee. Added the opt-in
AbCipDriverOptions.EnableDeclarationOnlyUdtGrouping flag (default false);
AbCipUdtReadPlanner.Build forms no groups when it is off, so by default
every UDT member reads per-tag rather than at possibly-wrong offsets.

Driver.AbCip-008 — probe loops were fire-and-forget and ShutdownAsync raced
them. Each probe Task is stored on DeviceState.ProbeTask; ShutdownAsync now
cancels every CTS, awaits each probe Task (10s timeout), then disposes the
CTS and handles. DeviceState.Runtimes/ParentRuntimes are ConcurrentDictionary
and the Ensure*RuntimeAsync paths use TryAdd, disposing the losing concurrent
creator instead of leaking a native tag handle.

Adds AbCipDriverCodeReviewRegressionTests and updates existing AbCip tests
to the corrected status constants + opt-in grouping flag. AbCip driver +
test project build clean; all 244 AbCip tests pass. (The full-solution
build has pre-existing, unrelated Driver.Galaxy protobuf-generation errors
in this worktree.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:41:25 -04:00
Joseph Doherty 5197b6c237 fix(driver-twincat): resolve High code-review findings (Driver.TwinCAT-001, -002, -007, -008, -013)
Driver.TwinCAT-001 — InitializeAsync/ReinitializeAsync ignored driverConfigJson.
Extracted the DTO-to-options parse into a shared TwinCATDriverFactoryExtensions.ParseOptions;
InitializeAsync now re-parses driverConfigJson into a mutable _options field, so a config
generation pushed via ReinitializeAsync (added/removed devices, tags, probe settings) is
actually applied at runtime.

Driver.TwinCAT-002 — LInt/ULInt narrowed to Int32. ToDriverDataType now maps LInt to Int64,
ULInt to UInt64, UDInt to UInt32, UInt/USInt to UInt16, Int/SInt to Int16, and the IEC
TIME/DATE/DT/TOD types to UInt32 (their raw UDINT counter). Removed the stale "Int64 gap"
comment — no truncation or sign flips at the OPC UA encode layer.

Driver.TwinCAT-007 — EnsureConnectedAsync was not thread-safe. Connect/reconnect is now
serialized per device by a SemaphoreSlim (DeviceState.ConnectGate) with a double-checked
connect, mirroring the S7 driver. Concurrent read/write/probe callers can no longer leak a
client or race a create-vs-dispose.

Driver.TwinCAT-008 — native ADS notification callbacks ran driver logic on the AMS router
thread. AdsTwinCATClient now enqueues AdsNotificationEx callbacks onto a bounded Channel
drained by a dedicated managed task; the router-thread callback only does a non-blocking
TryWrite, so a slow consumer cannot stall ADS notification delivery process-wide.

Driver.TwinCAT-013 — TwinCATDriver did not implement IRediscoverable. The driver now
implements IRediscoverable; AdsTwinCATClient detects ADS 0x0702 (symbol-version-changed) on
read/write paths and raises OnSymbolVersionChanged, which the driver forwards as
OnRediscoveryNeeded so Core rebuilds the address space after a PLC program re-download.

Adds TwinCATHighFindingsRegressionTests covering all five fixes; updates the data-type
mapping assertion in TwinCATDriverTests. TwinCAT driver builds clean; 119 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:37:05 -04:00
Joseph Doherty 66e8bfbab3 fix(virtual-tags): resolve High code-review finding (Core.VirtualTags-001)
OnScriptSetVirtualTag updated the value cache, notified observers, and
recorded history for the written path but never scheduled a cascade for
tags depending on that path. This contradicts docs/VirtualTags.md, which
states ctx.SetVirtualTag writes "still participate in change-trigger
cascades": a change-triggered virtual tag reading a script-written tag
went stale until an unrelated trigger fired.

OnScriptSetVirtualTag now launches a fire-and-forget CascadeAsync for the
written path, mirroring OnUpstreamChange. The cascade is scheduled rather
than invoked inline because the callback runs inside EvaluateInternalAsync
while the non-reentrant _evalGate semaphore is held.

Added regression test
SetVirtualTag_within_script_cascades_to_dependents_of_the_written_tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:40 -04:00
Joseph Doherty e3f8fa535a fix(scripted-alarms): resolve High code-review finding (Core.ScriptedAlarms-001)
_alarms was a plain Dictionary<string, AlarmState> mutated under the
_evalGate semaphore, but four read paths (GetState, GetAllStates, the
LoadedAlarmIds property, and RunShelvingCheck) touched it from arbitrary
threads with no synchronisation. A Dictionary read concurrent with a
writer's entry reassignment can throw InvalidOperationException or return
torn state.

Switched _alarms to ConcurrentDictionary<string, AlarmState>. The only
write shapes are indexer-set and Clear, both atomic on ConcurrentDictionary,
so all mutations stay correct without further change; reads now get safe
snapshot semantics. LoadedAlarmIds materialises the key snapshot to keep
its IReadOnlyCollection<string> return type. This matches _valueCache,
which is already a ConcurrentDictionary.

Added a regression test (Concurrent_reads_during_mutation_do_not_throw)
that hammers the engine with state mutations while four reader threads
continuously call the three unguarded read paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:40 -04:00
Joseph Doherty 4638366b77 fix(alarm-historian): resolve High code-review findings (Core.AlarmHistorian-002, -004, -006)
Core.AlarmHistorian-002 — drain loop now honors exponential backoff:
StartDrainLoop arms a self-rescheduling one-shot Timer. RescheduleDrain
sets the next due-time to max(tickInterval, CurrentBackoff) while the
sink is BackingOff, so a historian outage genuinely slows the cadence
down the 1s->2s->5s->15s->60s ladder instead of hammering at the fixed
tick. Class doc-comment updated.

Core.AlarmHistorian-004 — SQLite busy handling: the connection string
is built via SqliteConnectionStringBuilder with DefaultTimeout=5, and a
new OpenConnection helper applies PRAGMA busy_timeout=5000 and
PRAGMA journal_mode=WAL on every open. A concurrent enqueue-vs-drain
file-lock collision now waits the lock out instead of failing fast with
SQLITE_BUSY. All connection open sites switched to the helper.

Core.AlarmHistorian-006 — drain-loop faults are no longer unobserved:
the timer callback (DrainTimerCallback) awaits DrainOnceAsync inside a
try/catch that logs via _logger.Error, records the message into
_lastError, and sets _drainState=BackingOff so a stalled drain is
visible on GetStatus; a finally always re-arms the timer.

Regression tests added to SqliteStoreAndForwardSinkTests:
StartDrainLoop_honors_backoff_and_slows_cadence_under_retry,
StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy,
StartDrainLoop_records_drain_fault_and_keeps_running,
Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy.

findings.md: 002/004/006 marked Resolved; open count 10 -> 7.

Build: clean (0 warnings). Tests: 20/20 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:39 -04:00
Joseph Doherty 6300a9e4a8 fix(driver-cli-common): resolve High code-review finding (Driver.Cli.Common-001)
SnapshotFormatter.FormatStatus mapped four OPC UA status names to
incorrect numeric codes, mislabelling operator-facing CLI output. The
codes were corrected to their canonical OPC Foundation
Opc.Ua.StatusCodes values:

  BadTimeout                0x80060000 -> 0x800A0000
  BadNoCommunication        0x80070000 -> 0x80310000
  BadWaitingForInitialData  0x80080000 -> 0x80320000
  BadNodeIdInvalid          0x80350000 -> 0x80330000

The Cli.Common project does not reference the Opc.Ua package (only
Core.Abstractions / CliFx / Serilog), so the hex literals were
corrected in place with a sync note rather than adding a heavy new
dependency.

SnapshotFormatterTests was updated: the [Theory] expectations now use
the correct spec codes and assert the full rendered form, plus a new
regression [Theory] confirms the pre-fix wrong names no longer apply.
All 24 tests pass.

findings.md: Driver.Cli.Common-001 set to Resolved; open count 6 -> 5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:39 -04:00
Joseph Doherty e221371a0c fix(client-shared): resolve High code-review findings (Client.Shared-005, Client.Shared-006)
Client.Shared-005: _activeDataSubscriptions (a plain Dictionary) and the
_activeAlarmSubscription tuple were mutated from the caller thread, the
keep-alive failover path, and DisconnectAsync with no synchronization,
risking bucket corrosion / InvalidOperationException / lost entries.
Added a dedicated _subscriptionLock and wrapped every read/write of that
bookkeeping state inside it (Subscribe/Unsubscribe[Alarms]Async,
Disconnect, Dispose, and the snapshot/clear/re-record steps of
ReplaySubscriptionsAsync). Awaited adapter calls stay outside the lock so
it is never held across I/O.

Client.Shared-006: HandleKeepAliveFailureAsync had only a non-atomic
state check guarding re-entry, so two bad keep-alives could each start a
failover loop, racing to dispose/replace _session and double-replaying
subscriptions. It now claims an atomic _failoverInProgress slot via
Interlocked.CompareExchange; a re-entrant call returns immediately. The
loop body moved to RunFailoverAsync, wrapped in try/finally that resets
the flag.

Tests: added KeepAliveFailure_ReentrantWhileFailoverInFlight_RunsFailoverOnce
and SubscribeAndUnsubscribe_ConcurrentCalls_DoNotCorruptState regression
tests; made the FakeSubscriptionAdapter / FakeSessionAdapter /
FakeSessionFactory test doubles thread-safe (and added a CreateGate hook)
so the concurrency tests exercise production locking rather than fake
state. All 138 Client.Shared tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:38 -04:00
Joseph Doherty 3de688f8d6 fix(admin): resolve High code-review findings (Admin-003, Admin-004, Admin-005)
Admin-003 — SignalR hubs were anonymously reachable: an unauthenticated
client could open /hubs/fleet, /hubs/alerts and /hubs/script-log and
stream fleet state, alert detail text and server script-log contents.
Added [Authorize] to FleetStatusHub, AlertHub and ScriptLogHub, and
chained .RequireAuthorization() onto all three MapHub() calls as a
belt-and-braces backstop.

Admin-004 — appsettings.json committed live-looking secrets (the `sa`
ConfigDb password and the LDAP ServiceAccountPassword) in plaintext.
Replaced both with empty placeholders sourced from user-secrets (dev) or
the ConnectionStrings__ConfigDb / Authentication__Ldap__ServiceAccountPassword
environment variables (prod); added a UserSecretsId to the Admin csproj
and a fail-fast guard in Program.cs when ConfigDb is empty/missing.

Admin-005 — Login.razor performed SignInAsync from an interactive Blazor
circuit, where the original HTTP response has long completed so the auth
cookie was not emitted. Rewrote it as a static-rendered plain HTML form
(data-enhance="false") posting to a new AuthEndpoints.MapAuthEndpoints()
minimal-API handler (/auth/login, /auth/logout) that does the LDAP bind,
grant resolution, cookie SignInAsync and redirect while the endpoint
still owns the response. Includes an open-redirect guard on returnUrl.

Added xUnit + Shouldly regression tests: AuthEndpointsTests (login cookie
issuance, failed-bind redirect, open-redirect rejection, logout, anonymous
hub negotiate rejection) and AppSettingsSecretHygieneTests (no committed
secrets). All 26 auth-related tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:27:38 -04:00
Joseph Doherty abbf49141c fix(core): resolve High code-review findings (Core-001, Core-002)
Core-001: swap the authorization-cache defaults so
MembershipFreshnessInterval (5 min, inner re-resolve trigger) is
strictly less than AuthCacheMaxStaleness (15 min, fail-closed
ceiling), so NeedsRefresh's warm-refresh path is reachable.

Core-002: TriePermissionEvaluator.Authorize now compares the trie's
GenerationId against the session's AuthGenerationId and re-fetches the
session's bound generation on mismatch, failing closed when that
generation has been pruned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:13:01 -04:00
Joseph Doherty ee51878c08 fix(configuration): resolve High code-review findings (Configuration-001, Configuration-008)
Configuration-001: wrap the EXEC dbo.sp_ValidateDraft call in
sp_PublishGeneration in a BEGIN TRY/CATCH ROLLBACK; THROW block so a
validation RAISERROR aborts the publish instead of being ignored.

Configuration-008: route caller-supplied strings interpolated into
ConfigAuditLog.DetailsJson through STRING_ESCAPE(@x, 'json') and emit
sp_RollbackToGeneration's @TargetGenerationId as a bare JSON number,
closing the JSON-injection / denial-of-operation vector.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:12:00 -04:00
Joseph Doherty adf794f791 fix(server): resolve High code-review findings (Server-002, Server-009)
Server-002 — AuthorizationGate lax-mode no longer overrides explicit deny.
IsAllowed now switches on the evaluator's AuthorizationVerdict: Allow -> true,
Denied (an authored deny rule matched) -> false in BOTH strict and lax mode,
and only the indeterminate NotGranted case falls through to !_strictMode.
Previously `if (decision.IsAllowed) return true; return !_strictMode;` let lax
mode (the default) nullify authored NodeAcl deny rules for fully-resolved
sessions. The tri-state AuthorizationVerdict.Denied member is now honoured.

Server-009 — LDAP is secure-by-default. LdapOptions.AllowInsecureLdap now
defaults to false (was true) and Program.cs's config fallback reads `?? false`
(was `?? true`), so an LDAP-enabled deployment will not bind credentials over
an unencrypted socket unless an operator explicitly opts in. Program.cs also
logs a startup warning when LDAP is enabled with UseTls=false and
AllowInsecureLdap=true, flagging the clear-text server->LDAP credential hop.

Regression tests: AuthorizationGateTests covers all four verdict x mode
combinations via a fixed-verdict evaluator stub; new LdapOptionsTests asserts
the secure defaults. Both Server and Server.Tests build clean; the 15 targeted
tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:11:06 -04:00
Joseph Doherty 7bb21c2aa2 fix(scripting): resolve High code-review finding (Core.Scripting-002)
The ForbiddenTypeAnalyzer syntax walker only inspected four node kinds
(ObjectCreation, Invocation-with-member-access, MemberAccess, bare
Identifier), so a forbidden type named through typeof, a generic type
argument, a cast, an is/as type pattern, default(T), an array-creation
element type, or an explicitly-typed local declaration produced no
examined node and bypassed the sandbox check.

Analyze now runs a second pass that resolves GetTypeInfo on every
TypeSyntax node and recursively unwraps array element types and generic
type arguments, so forbidden types nested at any depth are rejected at
compile. The original member/call node-kind switch is kept deliberately
narrow (rather than resolving GetSymbolInfo on every node) to avoid
flagging harmless inherited members such as typeof(int).Name, whose Name
property is declared by System.Reflection.MemberInfo. A span+type dedupe
keeps the two passes from emitting duplicate rejections.

Regression tests added in ScriptSandboxTests cover typeof, generic type
arguments, casts, default(T), is/as patterns, array element types, and
typed local declarations with forbidden types, plus over-block guards
asserting allowed generics and typeof still compile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:08:08 -04:00
Joseph Doherty 8c7c605478 docs(code-reviews): regenerate index — 6 Critical findings resolved
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:54:40 -04:00
Joseph Doherty 4df8737c86 fix(driver-galaxy): wire event-stream faults to the reconnect supervisor (Driver.Galaxy-001)
The ReconnectSupervisor was constructed but its trigger
ReportTransportFailure was never called. When the gateway StreamEvents
stream faulted, EventPump just logged and exited — the supervisor was
never notified, so a transient gateway drop permanently stopped
data-change notifications while GetHealth() still reported Healthy.

EventPump gains an optional onStreamFault callback invoked from its
stream-fault catch block (not on clean shutdown). GalaxyDriver wires it
to ReconnectSupervisor.ReportTransportFailure so a transport drop drives
reopen → replay.

This is the minimal fix for -001; the pump-restart-on-reopen gap remains
tracked as Driver.Galaxy-008. Regression tests cover the callback being
invoked on fault, the end-to-end supervisor reopen/replay, and that a
clean shutdown does not fire it. Driver.Galaxy suite: 206/206 pass.

Resolves code-review finding Driver.Galaxy-001 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:54:33 -04:00
Joseph Doherty 796871c210 fix(alarm-historian): keep queue rows aligned to events on drain (Core.AlarmHistorian-001)
ReadBatch built parallel rowIds / events lists: rowIds.Add ran for every
row but events.Add was guarded by `if (evt is not null)`. A corrupt /
null-deserializing payload desynced the lists, so DrainOnceAsync applied
each outcome to the wrong RowId — an Ack could delete an un-sent event
(silent alarm-event data loss) and the corrupt row stalled the queue
head forever.

ReadBatch now returns a single list of QueueRow(long RowId,
AlarmHistorianEvent? Event) records so a rowId can never drift from its
event; deserialization is wrapped to yield null on JsonException.
DrainOnceAsync immediately dead-letters rows whose payload is
null/un-deserializable and forwards only well-formed events to the
writer, mapping outcomes by RowId.

Regression tests cover a corrupt row mid-batch and at the queue head.
Core.AlarmHistorian suite: 16/16 pass.

Resolves code-review finding Core.AlarmHistorian-001 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:54:20 -04:00
Joseph Doherty cfb9ff1032 fix(scripting): block dangerous System types in the script sandbox (Core.Scripting-001)
ForbiddenTypeAnalyzer used only a namespace-prefix deny-list. System.Environment,
System.AppDomain, System.GC and System.Activator live directly in the System
namespace, which must stay allowed for primitives (Math, String, ...), so they
were never caught — an operator-authored predicate could call
System.Environment.Exit(0) and terminate the in-process OPC UA server.

Add a type-granular deny-list (ForbiddenFullTypeNames) checked by
fully-qualified type name after the namespace-prefix check; legitimate System
types are unaffected.

Regression tests assert scripts referencing Environment/AppDomain/GC/Activator
are rejected at analysis time. Core.Scripting suite: 68/68 pass.

Resolves code-review finding Core.Scripting-001 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:54:08 -04:00
Joseph Doherty 973730d0eb fix(admin): enforce authentication on all Admin UI routes (Admin-001/002)
Admin-001: Routes.razor used a plain RouteView, so the page-level
[Authorize] attributes on 11 pages were inert — every page, including
mutating ones, was reachable fully unauthenticated.
Admin-002: several pages (e.g. NewCluster, which writes config rows)
carried no auth attribute at all.

- Routes.razor: RouteView → AuthorizeRouteView with NotAuthorized /
  Authorizing slots; add RedirectToLogin component.
- Program.cs: SetFallbackPolicy(RequireAuthenticatedUser) — secure by
  default for new pages/endpoints.
- Login.razor: [AllowAnonymous] so login stays reachable; login page,
  /auth/* endpoints and static assets remain anonymous.
- Add [Authorize] to the previously un-gated pages; NewCluster gated to
  the CanPublish (FleetAdmin) policy.

Regression tests in PageAuthorizationTests pin that anonymous requests
to protected/mutating routes are rejected and that login + static
assets stay anonymously reachable. Admin test suite: 210/210 pass.

Resolves code-review findings Admin-001 and Admin-002 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:53:58 -04:00
Joseph Doherty 571066130b fix(server): stop WriteNodeIdUnknown infinite recursion (Server-001)
WriteNodeIdUnknown called itself unconditionally as its first statement
— unbounded recursion with no base case → StackOverflowException, an
uncatchable process crash reachable by any client issuing a HistoryRead
on an unresolvable NodeId (remote DoS).

Replace the self-call with the result-slot assignment, mirroring
WriteUnsupported / WriteInternalError. The helper is now internal so the
regression test can pin the StatusCode without a server fixture.

Resolves code-review finding Server-001 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:53:44 -04:00
Joseph Doherty 8568f5cd85 docs(code-reviews): comprehensive per-module review pass at 76d35d1
Reviewed all 31 src/ production projects against the 10-category
checklist in REVIEW-PROCESS.md. Each module gets its own findings.md;
code-reviews/README.md is regenerated from them.

334 findings: 6 Critical, 46 High, 126 Medium, 156 Low.

Critical findings:
- Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead
  on an unresolvable node crashes the process (remote DoS).
- Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView)
  plus unauthenticated mutating routes.
- Core.Scripting-001: System.Environment reachable from operator scripts;
  Environment.Exit() terminates the server.
- Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt
  payload misapplies outcomes — silent alarm-event data loss.
- Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so
  a transient gateway drop permanently kills the event stream.

All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md
section 4. Review only — no source code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 05:20:27 -04:00
Joseph Doherty 76d35d1b9f chore: add per-module code review process and tracking infra
Adapts the code-review procedure, folder layout, template, and tooling
from the sibling mxaccessgw repo to lmxopcua.

- REVIEW-PROCESS.md: per-module review workflow — a module is one src/
  or tests/ project (ZB.MOM.WW.OtOpcUa. prefix stripped); 10-category
  checklist; finding IDs/severities/statuses; re-review rules.
- code-reviews/_template/findings.md: per-module findings template.
- code-reviews/regen-readme.py: generates the cross-module README.md
  index from the per-module findings.md files; --check gates staleness
  and consistency.
- code-reviews/test_regen_readme.py: dependency-free generator tests.
- code-reviews/prompt.md: orchestration prompt for clearing the backlog.
- code-reviews/README.md: generated index (no modules reviewed yet).
- scripts/check-code-reviews-readme.ps1: CI / pre-commit check wrapper.

Adapted to this repo: ZB.MOM.WW.OtOpcUa module naming, OtOpcUa
conventions checklist (in-process GalaxyDriver + mxaccessgw,
contained-name vs tag-name, ACL at DriverNodeManager), single .NET
solution build/test commands, and the lmxopcua design docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 04:08:47 -04:00
Joseph Doherty 27a8d05b7c feat(driver-galaxy): consume the gateway's session-less alarm model
The mxaccessgw updated alarms to a session-less central monitor:
AcknowledgeAlarm dropped SessionId and alarm transitions now come from
the session-less StreamAlarms feed instead of the per-session worker
StreamEvents stream. The GalaxyDriver no longer compiled against the
updated client.

- GatewayGalaxyAlarmAcknowledger: session-less rewrite — no GalaxyMxSession;
  outcome read from ProtocolStatus (throw) and Hresult (warn).
- New IGalaxyAlarmFeed seam + GatewayGalaxyAlarmFeed: background consumer
  of StreamAlarms that decodes the active-alarm snapshot plus live
  transitions into GalaxyAlarmTransition and reopens the stream on
  transport faults.
- EventPump: drop the dead per-session OnAlarmTransition path; the
  per-session stream no longer carries alarms.
- GalaxyDriver: bridge the feed onto IAlarmSource.OnAlarmEvent; the feed
  starts on SubscribeAlarmsAsync, independent of data subscriptions.
- Tests: replace EventPumpAlarmTests with GatewayGalaxyAlarmFeedTests;
  move the driver alarm-source tests onto the IGalaxyAlarmFeed seam.

Browse needed no change — GatewayGalaxyHierarchySource consumes the
unchanged DiscoverHierarchy contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 03:59:36 -04:00
Joseph Doherty cd2306db66 feat(historian-sidecar): live aahClientManaged alarm-event write path (C.1)
SdkAlarmHistorianWriteBackend.WriteBatchAsync replaces the RetryPlease
placeholder with the real entry point — HistorianAccess.AddStreamedValue
(HistorianEvent, out HistorianAccessError) in aahClientManaged, pinned by
decompiling the installed SDK.

The write path opens its own ReadOnly=false connection: the query-side
HistorianDataSource opens ReadOnly sessions and AddStreamedValue fails on
those with WriteToReadOnlyFile. IHistorianConnectionFactory gains a readOnly
parameter (default true, query path unchanged); BuildConnectionArgs is
extracted as a pure helper. HistorianClusterEndpointPicker is shared for
node failover; connection-class errors abort the batch as RetryPlease and
reset the connection, malformed-input codes map to PermanentFail.

Tests: connection-unavailable batch deferral, ClassifyOutcome error-code
table, BuildConnectionArgs read-vs-write shaping (80 pass, 2 rig-skipped).
Live_* round-trip tests stay Skip-gated for the D.1 rollout smoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:08:32 -04:00
Joseph Doherty 419eda256b feat(server): route OPC UA Part 9 AddComment to ScriptedAlarmEngine
RouteScriptedAlarmMethodCalls now handles ConditionType.AddComment
alongside Acknowledge/Confirm, dispatching to engine.AddCommentAsync.
An empty comment is rejected by the Part 9 state machine and surfaced
as BadInvalidArgument. MapCallOperation gates AddComment at the
AlarmAcknowledge tier — there is no dedicated AddComment permission bit.

Closes phase-7-status.md Gap 1: all Part 9 alarm methods now route to
the engine. Adds 3 unit tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:43:03 -04:00
Joseph Doherty c5915700bd feat(server): route OPC UA Part 9 shelve methods to ScriptedAlarmEngine (#24)
OneShotShelve / TimedShelve / Unshelve now reach the ScriptedAlarmEngine.
Scripted-alarm condition nodes get a ShelvedStateMachine subtree created
before alarm.Create so the stack wires each shelve method's dispatch
handler; AlarmConditionState.OnShelve / OnTimedUnshelve route to the
engine and mirror the result onto the OPC UA node via SetShelvingState.

The three per-instance shelve method NodeIds are indexed so the Call gate
resolves them to OpcUaOperation.AlarmShelve instead of falling through to
generic Call. Engine dispatch is split into the node-free InvokeEngineShelve
so the routing decision is unit-testable.

Adds 9 unit tests; updates phase-7-status.md Gap 1 (only AddComment remains
unwired) and the #24 entry in looseends.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:31:30 -04:00
268 changed files with 22774 additions and 1958 deletions
+162
View File
@@ -0,0 +1,162 @@
# Code Review Process
This document describes how to perform a comprehensive, per-module code review of
the `lmxopcua` codebase (the ZB.MOM.WW.OtOpcUa OPC UA server) and how to track
findings to resolution.
A **module** is one buildable project under `src/` (e.g.
`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy`) or one test project under `tests/`
(e.g. `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests`). Each module has its
own folder under `code-reviews/` containing a single `findings.md`.
## 1. Before you start
1. Pick the module to review. Its folder is `code-reviews/<Module>/`, where
`<Module>` is the project name with the `ZB.MOM.WW.OtOpcUa.` prefix stripped:
- `src/Server/ZB.MOM.WW.OtOpcUa.Server` is reviewed in `code-reviews/Server/`.
- `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy``code-reviews/Driver.Galaxy/`.
- `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions``code-reviews/Core.Abstractions/`.
- `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests`
`code-reviews/Driver.Galaxy.Tests/`.
The solution `ZB.MOM.WW.OtOpcUa.slnx` enumerates every project; `src/` is
grouped into `Core/`, `Server/`, `Drivers/`, `Client/`, and `Tooling/`.
2. Identify the design context for the module:
- `CLAUDE.md` — project goal, the data-flow architecture, the contained-name
vs tag-name concept, and the **Library Preferences** / build & runtime
constraints.
- `StyleGuide.md` — repository code-style conventions.
- The relevant docs under `docs/` and `docs/v2/` — e.g. `docs/OpcUaServer.md`,
`docs/AddressSpace.md`, `docs/ReadWriteOperations.md`, `docs/security.md`,
`docs/Redundancy.md`, `docs/ScriptedAlarms.md`, `docs/AlarmTracking.md`,
`docs/ServiceHosting.md`, `docs/v2/plan.md`, `docs/v2/acl-design.md`,
`docs/v2/driver-specs.md`, `docs/v2/driver-stability.md`, the
`docs/v2/Galaxy.*.md` set, and the driver notes under `docs/drivers/`.
- The auto-memory index at
`~/.claude/projects/.../memory/MEMORY.md` records non-obvious project
decisions and is worth a scan before a review.
3. Record the exact commit being reviewed: `git rev-parse --short HEAD`. Every
review is a snapshot — a finding only means something relative to a known
commit.
4. Open `code-reviews/<Module>/findings.md` (copy it from
[`code-reviews/_template/findings.md`](code-reviews/_template/findings.md) if it
does not exist yet) and fill in the header table (reviewer, date, commit SHA,
status).
## 2. Review checklist
Work through **every** category below for the module. A comprehensive review
means the checklist is completed even where it produces no findings — record
"No issues found" for a category rather than leaving it ambiguous.
1. **Correctness & logic bugs** — off-by-one, null handling, incorrect
conditionals, misuse of APIs, broken edge cases, wrong data-type mapping.
2. **OtOpcUa conventions** — the rules in `CLAUDE.md` and `StyleGuide.md`: Galaxy
access flows through the in-process `GalaxyDriver` over gRPC to the separately
installed `mxaccessgw` gateway — nothing in this repo loads MXAccess COM
directly; browse uses **contained names** and runtime read/write uses
**tag names** (`tag_name.AttributeName`); authorization decisions happen in
`DriverNodeManager` at the server layer, never in driver-level code — drivers
only report `SecurityClassification` as metadata; .NET 10 / AnyCPU; Serilog
with a rolling daily file sink; xUnit + Shouldly for unit tests; the .NET
generic host with `AddWindowsService` for the Server and Admin hosts; the OPC
Foundation UA .NET Standard stack for OPC UA; generated code is not
hand-edited.
3. **Concurrency & thread safety** — shared mutable state, race conditions,
correct use of `async`/`await`, locking, disposal races, background-loop and
reconnect-supervisor lifetimes.
4. **Error handling & resilience** — exception paths, driver/gateway reconnect
handling, transient vs permanent error classification, graceful degradation,
correct OPC UA `StatusCode`s, address-space rebuild on redeploy.
5. **Security** — OPC UA transport security profiles (`SecurityProfileResolver`),
LDAP bind authentication and the group→permission mapping
(`LdapUserAuthenticator`), ACL enforcement at the `DriverNodeManager` layer,
input validation, SQL injection in the `ConfigDb` / Galaxy Repository queries,
certificate handling, and secret handling (no logging of credentials, LDAP
service-account passwords, or API keys).
6. **Performance & resource management**`IDisposable` disposal, gRPC channel /
stream / session lifetimes, buffering and back-pressure on event pumps,
unnecessary allocations on hot paths, N+1 queries.
7. **Design-document adherence** — does the code match `CLAUDE.md`, the relevant
`docs/` and `docs/v2/` designs? Flag both code that drifts from the design and
design docs that are now stale.
8. **Code organization & conventions** — namespace hierarchy, project layout, the
Options pattern, separation of concerns, the capability-interface seams
(`IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, etc.).
9. **Testing coverage** — are the module's behaviours covered? Unit suites are
`*.Tests` (xUnit + Shouldly); integration suites are `*.IntegrationTests` and
need their Docker fixture up; DB-backed tests in `*.Configuration.Tests`,
`*.Admin.Tests`, and `*.Server.Tests` need the central SQL Server. Note
untested critical paths and missing edge-case tests.
10. **Documentation & comments** — XML doc accuracy, misleading or stale comments,
undocumented non-obvious behaviour.
## 3. Recording findings
Add one entry per finding to the `## Findings` section of the module's
`findings.md`, using the entry format in
[`_template/findings.md`](code-reviews/_template/findings.md).
- **Finding ID** — `<Module>-NNN`, numbered sequentially within the module and
never reused (e.g. `Driver.Galaxy-001`). IDs are permanent even after
resolution.
- **Severity:**
- **Critical** — data loss, security breach, crash/deadlock, or outage.
- **High** — incorrect behaviour with significant impact; no safe workaround.
- **Medium** — incorrect or risky behaviour with limited impact or a workaround.
- **Low** — minor issues, style, maintainability, documentation.
- **Category** — one of the 10 checklist categories above.
- **Location** — `file:line` (clickable), or a list of locations.
- **Description** — what is wrong and why it matters.
- **Recommendation** — concrete suggested fix.
After recording findings, update the module header table (status, open-finding
count) and regenerate the base README (step 5).
## 4. Marking an item resolved
Findings are **never deleted** — they are an audit trail. To close one, change
its **Status** and complete the **Resolution** field:
- `Open` — newly recorded, not yet addressed.
- `In Progress` — a fix is actively being worked on.
- `Resolved` — fixed. The Resolution field must state the fixing commit SHA, the
date, and a one-line description of the fix.
- `Won't Fix` — intentionally not fixed. The Resolution field must justify why.
- `Deferred` — valid but postponed. The Resolution field must say what it is
waiting on (e.g. a tracked issue or a later milestone).
`Resolved`, `Won't Fix`, and `Deferred` findings are all considered **closed**.
`Open` and `In Progress` are **pending** and appear in the base README's Pending
Findings table.
## 5. Updating the base README
`code-reviews/README.md` holds the single cross-module view (the Module Status
table and the Pending / Closed Findings tables). It is **generated** from the
per-module `findings.md` files — do not edit it by hand.
After any review or status change, regenerate it:
```
python code-reviews/regen-readme.py
```
`regen-readme.py --check` exits non-zero if `README.md` is stale, if a module
header's `Open findings` count disagrees with its finding statuses, or if a
finding carries an unrecognised Status value. The PowerShell wrapper
`scripts/check-code-reviews-readme.ps1` runs that check and is the intended hook
for CI or a pre-commit step. `code-reviews/test_regen_readme.py` covers the
generator itself (`python code-reviews/test_regen_readme.py`).
> The repo's installed `python` is the real interpreter; the bare `python3`
> alias on this box resolves to the Windows Store stub and fails. Use `python`.
The per-module `findings.md` files are the source of truth; `README.md` is the
aggregated index and must always agree with them — which the script guarantees.
## 6. Re-reviewing a module
Re-reviews append to the same `findings.md`. Update the header to the new commit
and date, continue the finding numbering from the last used ID, and leave prior
findings (including closed ones) in place as history.
+2
View File
@@ -0,0 +1,2 @@
# regen-readme.py / test_regen_readme.py bytecode cache
__pycache__/
+222
View File
@@ -0,0 +1,222 @@
# Code Review — Admin
| Field | Value |
|---|---|
| Module | `src/Server/ZB.MOM.WW.OtOpcUa.Admin` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 3 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Admin-005 |
| 2 | OtOpcUa conventions | Admin-010 |
| 3 | Concurrency & thread safety | Admin-011 |
| 4 | Error handling & resilience | Admin-008, Admin-013 |
| 5 | Security | Admin-001, Admin-002, Admin-003, Admin-004, Admin-006 |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | Admin-007, Admin-012 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Admin-009 |
| 10 | Documentation & comments | Admin-012 |
## Findings
### Admin-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Security |
| Location | `Components/Routes.razor:4-11`, `Program.cs:150` |
| Status | Resolved |
**Description:** The router uses a plain `RouteView` (not `AuthorizeRouteView`), and `MapRazorComponents<App>()` is registered without `.RequireAuthorization()`. A page-level `[Authorize]` attribute on a routable Razor component is only enforced when the router is `AuthorizeRouteView` — with `RouteView` the attribute is inert. Consequently every page in the app, including those that carry `@attribute [Authorize]` (`ClusterDetail`, `DraftEditor`, `Reservations`, `RoleGrants`, `Certificates`, `VirtualTags`, `ScriptedAlarms`, `ScriptLog`, `DiffViewer`, `ImportEquipment`, `Account`), is reachable by a fully unauthenticated user. There is no authentication gate anywhere in the pipeline. An anonymous browser can read the full fleet configuration, audit log, certificates and ACLs, and exercise mutating pages (see Admin-002).
**Recommendation:** Replace `RouteView` with `AuthorizeRouteView` in `Routes.razor` (with a `<NotAuthorized>` slot that redirects to `/login`), or call `.RequireAuthorization()` on the `MapRazorComponents` endpoint with `/login` and `/auth/*` explicitly allowed anonymous. Add a fallback policy (`AddAuthorizationBuilder().SetFallbackPolicy(...)`) so new pages are secure-by-default. Re-verify every page after the gate is in place.
**Resolution:** Resolved 2026-05-22 — `Routes.razor` switched to `AuthorizeRouteView` with a `NotAuthorized` slot routing unauthenticated callers to `/login` via a new `RedirectToLogin` component; `AddAuthorizationBuilder().SetFallbackPolicy(RequireAuthenticatedUser())` makes pages secure-by-default; `Login.razor` opts out with `[AllowAnonymous]` so the login page and static assets stay anonymous. Covered by `PageAuthorizationTests` (verified failing pre-fix, passing post-fix).
### Admin-002
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Security |
| Location | `Components/Pages/Clusters/NewCluster.razor:1-7`, `Home.razor`, `Fleet.razor`, `Hosts.razor`, `AlarmsHistorian.razor`, `Clusters/ClustersList.razor`, `Clusters/Generations.razor`, `Drivers/FocasDetail.razor` |
| Status | Resolved |
**Description:** Several routable pages carry no authorization attribute at all. Most critically `NewCluster` (`/clusters/new`) is a mutating page — its `CreateAsync` writes a new `ServerCluster` row and a draft generation. Combined with Admin-001 (the router does not enforce `[Authorize]` either), an unauthenticated user can create clusters and seed config-DB rows. `Home`, `Fleet`, `Hosts`, `AlarmsHistorian`, `ClustersList`, `Generations` and `FocasDetail` likewise expose fleet topology, host status, historian diagnostics and generation history to anonymous callers.
**Recommendation:** Add `@attribute [Authorize(...)]` to every routable page with the role/policy appropriate to its function (`NewCluster` and other write surfaces -> `CanPublish`/`CanEdit`; read pages -> an authenticated-user policy). A solution-wide fallback policy (see Admin-001) is the durable fix; per-page attributes remain the explicit declaration of intent.
**Resolution:** Resolved 2026-05-22 — `@attribute [Authorize]` added to every unprotected routable page (`Home`, `Fleet`, `Hosts`, `AlarmsHistorian`, `ClustersList`, `FocasDetail`, `ModbusAddressPreview`, `ModbusDiagnostics`); `NewCluster` gated with `[Authorize(Policy = "CanPublish")]` per the admin-ui.md FleetAdmin cluster-create flow. Re-triage note: `Clusters/Generations.razor` carries no `@page` directive — it is a child component of `ClusterDetail`, not a routable page, so it needs no attribute (it inherits the parent route's gate). The Admin-001 fallback policy is the durable secure-by-default backstop; the per-page attributes are the explicit declaration of intent. Covered by `PageAuthorizationTests`.
### Admin-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `Program.cs:137-139`, `Hubs/FleetStatusHub.cs:11`, `Hubs/AlertHub.cs:10`, `Hubs/ScriptLogHub.cs:30` |
| Status | Resolved |
**Description:** All three SignalR hubs (`/hubs/fleet`, `/hubs/alerts`, `/hubs/script-log`) are mapped with no `[Authorize]` attribute and no `.RequireAuthorization()` on the `MapHub` call. Any unauthenticated client can open a hub connection: `FleetStatusHub.SubscribeFleet()` streams every node generation/role/resilience state, `AlertHub` pushes all fleet alerts (including failure detail text), and `ScriptLogHub.TailLogAsync` streams the contents of the server `scripts-*.log` files. This is an unauthenticated information-disclosure channel that bypasses the (already broken — see Admin-001) page auth entirely.
**Recommendation:** Add `[Authorize]` to each `Hub` class, or chain `.RequireAuthorization()` onto each `MapHub(...)` call in `Program.cs`. The hub `SubscribeCluster`/`TailLogAsync` methods should additionally validate that the caller claims permit the requested cluster/script scope.
**Resolution:** Resolved 2026-05-22 — `[Authorize]` added to `FleetStatusHub`, `AlertHub` and `ScriptLogHub`, and `.RequireAuthorization()` chained onto all three `MapHub(...)` calls in `Program.cs` as a belt-and-braces backstop, so an anonymous client can no longer open any hub connection. Covered by `AuthEndpointsTests.Anonymous_hub_negotiate_is_rejected`.
### Admin-004
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `appsettings.json:3,13-14` |
| Status | Resolved |
**Description:** The checked-in `appsettings.json` contains live-looking secrets in plaintext: the `ConfigDb` connection string with `User Id=sa;Password=OtOpcUaDev_2026!` and the LDAP `ServiceAccountPassword: "serviceaccount123"`. It also sets `Encrypt=False` and `AllowInsecureLdap: true`, so the SQL and LDAP credentials travel unencrypted on the wire. Committing the `sa` account password and a service-account password to source control is a credential-exposure risk; `sa` additionally grants full server control, conflicting with the `ClusterService` doc comment that production should connect with a least-privilege grant.
**Recommendation:** Move all secrets out of the committed file — use user-secrets for dev and environment variables / a secret store for production; leave only non-secret placeholders in `appsettings.json`. Use a least-privilege SQL login rather than `sa`. Enable TLS for both SQL (`Encrypt=True`) and LDAP (`UseTls=true`, `AllowInsecureLdap=false`) for any non-loopback deployment, and document the dev-only exception.
**Resolution:** Resolved 2026-05-22 — the `sa` connection string and the LDAP `ServiceAccountPassword` were replaced with empty placeholders in `appsettings.json`; a `_secrets` note documents that they are supplied via user-secrets (dev) or the `ConnectionStrings__ConfigDb` / `Authentication__Ldap__ServiceAccountPassword` environment variables (prod), and that the connection string must use `Encrypt=True` and a least-privilege SQL login. A `UserSecretsId` was added to the Admin csproj, and `Program.cs` now fails fast with a clear message when `ConfigDb` is empty/missing. Covered by `AppSettingsSecretHygieneTests`.
### Admin-005
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `Components/Pages/Login.razor:15,107-110` |
| Status | Resolved |
**Description:** `Login.razor` is an interactive component (the project default render mode is interactive server; the page declares no `@rendermode` but uses `EditForm`/`InputText` interactive binding and runs `SignInAsync` from an event handler). It calls `HttpContext.SignInAsync(...)` followed by `ctx.Response.Redirect("/")` from within a SignalR circuit callback. Writing auth cookies and HTTP redirect headers requires a live, unstarted HTTP response; in an interactive circuit the original HTTP response has long completed, so the cookie is typically not emitted and the redirect is ineffective (or throws "response has already started"). `admin-ui.md` section "Operator authentication" explicitly specifies the login as a static server-rendered HTML form POSTing to a `/auth/login` minimal-API endpoint with `data-enhance="false"` — that endpoint is not implemented and is not mapped in `Program.cs`.
**Recommendation:** Implement the login as designed: a static-rendered form (`@rendermode` none, `data-enhance="false"`) posting to a `MapPost("/auth/login", ...)` minimal-API handler that does the LDAP bind, grant resolution, `SignInAsync` and redirect while the HTTP response is still owned by the endpoint. Do not perform `SignInAsync` from an interactive circuit.
**Resolution:** Resolved 2026-05-22 — `Login.razor` rewritten as a static-rendered plain HTML `<form method="post" action="/auth/login" data-enhance="false">` (no `@rendermode`, no `EditForm`/`SignInAsync` in a circuit); the LDAP bind, grant resolution, cookie `SignInAsync` and redirect now run in a new `AuthEndpoints.MapAuthEndpoints()` minimal-API handler (`/auth/login`, `/auth/logout`) while the endpoint still owns the HTTP response. The handler is `AllowAnonymous`, carries an open-redirect guard on `returnUrl`, and surfaces bind errors back to the login page via a query-string. Covered by `AuthEndpointsTests` (valid login issues the cookie, invalid login redirects with error, open-redirect rejected, logout clears the cookie).
### Admin-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `Components/Layout/MainLayout.razor:47-49`, `Program.cs:129,131-135` |
| Status | Resolved |
**Description:** `app.UseAntiforgery()` is enabled, but the Sign-out form (`<form method="post" action="/auth/logout">`) renders no antiforgery token, and the `MapPost("/auth/logout", ...)` endpoint does not call `.DisableAntiforgery()` or otherwise opt out. Depending on framework version this either makes logout fail with a 400 for legitimate users, or — if the endpoint is treated as exempt — leaves logout as an unprotected state-changing POST (CSRF logout). The same concern applies to the login form once Admin-005 is addressed.
**Recommendation:** Emit an antiforgery token in the logout form and let `UseAntiforgery()` validate it; or explicitly and deliberately mark the endpoint `.DisableAntiforgery()` if a tokenless logout is intended. Verify login/logout round-trips after the change.
**Resolution:** Resolved 2026-05-22 — `<AntiforgeryToken />` added to the sign-out form in `MainLayout.razor` and `.DisableAntiforgery()` removed from the `/auth/logout` endpoint so `UseAntiforgery()` validates the token; a tokenless POST now returns 400, preventing CSRF-logout. The login endpoint retains `.DisableAntiforgery()` (login is not a state-changing operation CSRF can abuse). `AuthEndpointsTests.Logout_without_antiforgery_token_is_rejected` regression-guards this.
### Admin-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `Components/Pages/Clusters/NewCluster.razor:91,95-96` |
| Status | Resolved |
**Description:** `NewCluster.CreateAsync` hardcodes `CreatedBy = "admin-ui"` (both on the `ServerCluster` row and the draft generation) instead of the signed-in operator principal name. `admin-ui.md` section "Audit" requires "the operator principal" be recorded on every write. The audit trail therefore cannot attribute cluster creation to a person. The same literal would apply to any anonymous creation that Admin-001/002 currently permit.
**Recommendation:** Pass the authenticated user identity (`ClaimTypes.Name` / `NameIdentifier` from the cascaded `AuthenticationState`) as `createdBy`. Apply the same pattern to every other Admin write path that records a `CreatedBy`/`PublishedBy`/`ReleasedBy` field.
**Resolution:** Resolved 2026-05-22 — `NewCluster.razor` and `ClusterDetail.razor` (the two pages that call `ClusterService.CreateAsync` / `GenerationService.CreateDraftAsync` with a hardcoded literal) now resolve `ClaimTypes.Name` / `ClaimTypes.NameIdentifier` from the cascaded `AuthenticationState` and pass the operator principal name as `createdBy`; the fallback is `"unknown"` (defensive, should never occur on an `[Authorize]`-gated page).
### Admin-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `Services/ReservationService.cs:28-37` |
| Status | Resolved |
**Description:** `ReservationService.ReleaseAsync` calls `sp_ReleaseExternalIdReservation` with only `@Kind`, `@Value`, `@ReleaseReason`. `admin-ui.md` section "Release an external-ID reservation" specifies the proc sets `ReleasedBy` to the FleetAdmin who performed the release, and the action is the only path that allows ZTag/SAPID reuse and "requires explicit FleetAdmin action with a documented reason." The service does not capture or pass the operator principal, so the compliance audit trail for a release records no actor (unless the proc derives it from the DB session login, which would be the shared service account, not the operator).
**Recommendation:** Add an operator-principal parameter to `ReleaseAsync`, pass it to the stored proc as `@ReleasedBy`, and have callers supply the signed-in user. Confirm the proc signature accepts it.
**Resolution:** Resolved 2026-05-22 — a new EF migration (`20260522000001_AddReleasedByToReleaseExternalIdReservation`) adds `@ReleasedBy nvarchar(128)` to `sp_ReleaseExternalIdReservation` and uses it for both `ExternalIdReservation.ReleasedBy` and `ConfigAuditLog.Principal` (replacing `SUSER_SNAME()`); `ReservationService.ReleaseAsync` gains a `releasedBy` parameter with a guard; `Reservations.razor` resolves `ClaimTypes.Name` / `ClaimTypes.NameIdentifier` from the cascaded `AuthenticationState` and passes the operator principal to the service.
### Admin-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Admin` (whole module) |
| Status | Resolved |
**Description:** The module most security-critical behaviours have no enforced test coverage at the boundary that matters. There is no test that an unauthenticated request to a page or hub is rejected (which would have caught Admin-001/002/003), no test of the login -> cookie issuance round-trip (Admin-005), and the `AdminRoleGrantResolver` / `ClusterRoleClaims` authorization logic is exercised only in isolation. `InternalsVisibleTo` points at `ZB.MOM.WW.OtOpcUa.Admin.Tests`, but the auth pipeline itself is not asserted end-to-end. Per `REVIEW-PROCESS.md` category 9 these are untested critical paths.
**Recommendation:** Add `WebApplicationFactory`-based integration tests asserting: (a) anonymous GET of each protected route returns 302->/login or 401; (b) anonymous hub connect is refused; (c) a valid login issues the cookie and a subsequent request is authorized; (d) a `ConfigViewer` is denied `CanPublish` pages. Wire the check into the `*.Admin.Tests` suite.
**Resolution:** Resolved 2026-05-22 — (a) covered by existing `PageAuthorizationTests`; (b) covered by existing `AuthEndpointsTests.Anonymous_hub_negotiate_is_rejected`; (c) covered by existing `AuthEndpointsTests.Valid_login_issues_the_auth_cookie_and_redirects_home`; (d) new `AdminAuthPipelineTests` adds a `WebApplicationFactory` with a `RoleInjectingHandler` that stamps requests with caller-supplied roles, asserting that `ConfigViewer` is denied `CanPublish`-gated pages (403/302) while `FleetAdmin` is permitted, and that a `FleetAdmin` session can reach protected pages.
### Admin-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `Components/App.razor:9,16` |
| Status | Open |
**Description:** `App.razor` loads Bootstrap CSS and JS from the `cdn.jsdelivr.net` CDN. `admin-ui.md` section "Tech Stack" specifies "Bootstrap 5 vendored under `wwwroot/lib/bootstrap/`" precisely so the Admin app has no third-party runtime dependency. A CDN reference makes the UI fail in air-gapped / locked-down fleet deployments (a stated deployment target), introduces an uncontrolled third-party origin, and is not covered by a Subresource Integrity hash.
**Recommendation:** Vendor Bootstrap under `wwwroot/lib/bootstrap/` and reference the local copies, as the design doc requires. If a CDN is retained for any asset, add `integrity` + `crossorigin` SRI attributes.
**Resolution:** _(open)_
### Admin-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `Hubs/FleetStatusPoller.cs:24-26,98-103` |
| Status | Open |
**Description:** `FleetStatusPoller` keeps three plain `Dictionary<>` fields (`_last`, `_lastRole`, `_lastResilience`) mutated from `PollOnceAsync`. The poller `ExecuteAsync` loop is single-threaded so the steady-state poll path is safe, but `ResetCache()` (exposed `internal` for tests) clears those same dictionaries with no synchronization. If a test (or any caller) invokes `ResetCache()` while a poll tick is mid-iteration, the `Dictionary` enumeration/mutation race can throw `InvalidOperationException` or corrupt state.
**Recommendation:** Either document `ResetCache()` as "only safe when the poller is stopped" and have tests stop the service first, or guard the three dictionaries with a lock / swap them atomically. Using `ConcurrentDictionary` (as the sibling `ResilientLdapGroupRoleMappingService` does) would make the intent explicit.
**Resolution:** _(open)_
### Admin-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `Services/EquipmentCsvImporter.cs:18-19,33-37,229,232` |
| Status | Open |
**Description:** `EquipmentCsvImporter` declares `EquipmentId` as a required CSV column and parses it into a `required` field. `admin-ui.md` section "Equipment CSV import" (revised after adversarial review finding #4) is explicit: "No `EquipmentId` column — operator-supplied EquipmentId would mint duplicate equipment identity on typos ... never accepted from CSV imports." `EquipmentId` is system-derived (`EQ-` plus first 12 hex chars of `EquipmentUuid`). Accepting it from CSV either contradicts the design or silently lets an import set an identity field the doc says is un-settable. The XML doc on the class also cites the column as required per "decision #117", so either the code or the design doc is stale. `EquipmentImportBatchService.StageRowsAsync` propagates `row.EquipmentId` into the staging row, so any change must cover the finalize path.
**Recommendation:** Reconcile with the design: drop `EquipmentId` from `RequiredColumns` and the `EquipmentCsvRow` shape (deriving it from `EquipmentUuid` at finalize time), or — if accepting it is a deliberate reversal — update `admin-ui.md` and the decision log so the two agree.
**Resolution:** _(open)_
### Admin-013
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `Components/Pages/Clusters/ClusterDetail.razor:180-197`, `Components/Pages/Clusters/AclsTab.razor`, `Components/Pages/Clusters/RedundancyTab.razor`, `Components/Pages/RoleGrants.razor`, `Components/Pages/Hosts.razor`, `Components/Pages/ScriptLog.razor`, `Program.cs:157-159` |
| Status | Resolved |
**Description:** The Admin-003 fix gated all three SignalR hubs with `[Authorize]` plus `.RequireAuthorization()`, but the six pages that open a client `HubConnection` to those hubs were never updated to authenticate. A server-side Blazor `HubConnection` runs inside the interactive circuit and has no access to the browser's HttpOnly `OtOpcUa.Admin` auth cookie, so the hub `negotiate` request returns 401. Four pages (`ClusterDetail`, `AclsTab`, `RedundancyTab`, `RoleGrants`) called `HubConnection.StartAsync()` with no `try`/`catch`, so the 401 surfaced as an unhandled exception — a full HTTP 500 page for the prerendered `/clusters/{ClusterId}` route (the core cluster-config surface) and a faulted circuit for the others. `Hosts` and `ScriptLog` already wrapped the connect in `try`/`catch`, so they did not crash, but the SignalR live-update feature was non-functional Admin-wide regardless. The Admin-003 hardening was therefore incomplete: it secured the hub server side without giving the in-process clients any way to present credentials. Discovered during a post-review browser smoke test of `/clusters/cluster-dev`.
**Recommendation:** Two parts. (1) Stop the crash: guard every `HubConnection.StartAsync()` in `try`/`catch`, matching the best-effort pattern already documented in `Hosts.razor` — a hub hiccup must degrade live updates, not fault the page. (2) Restore the feature: give the hub clients a real credential. Cookie forwarding is not viable (the HttpOnly cookie is unreachable from the interactive circuit and persisting it into page state would leak it), so add a token scheme — mint a short-lived token for the circuit's authenticated user and supply it via `HttpConnectionOptions.AccessTokenProvider`, with a matching server-side authentication handler on the hub endpoints.
**Resolution:** Resolved 2026-05-22 — (1) `StartAsync`/`SendAsync` wrapped in `try`/`catch` on `ClusterDetail`, `AclsTab`, `RedundancyTab` and `RoleGrants` so a hub failure degrades gracefully. (2) Added a bearer-token auth path: `HubTokenService` mints/validates short-lived tokens using ASP.NET Core Data Protection (no signing-key management, no new packages); `HubTokenAuthenticationHandler` is a custom `HubToken` scheme reading the token from the `Authorization: Bearer` header (negotiate) or the `access_token` query parameter (WebSocket upgrade); the `HubClients` authorization policy runs both the cookie and `HubToken` schemes and is applied via `RequireAuthorization("HubClients")` on all three `MapHub` calls; `AdminHubConnectionFactory` builds connections with an `AccessTokenProvider` that re-mints a token for the circuit's authenticated user on every (re)connect, and all six hub-consuming pages resolve their connections through it. Verified end-to-end in the browser: hub `negotiate` returns 200 and the WebSocket upgrades (101) where it previously 401'd.
+139
View File
@@ -0,0 +1,139 @@
# Code Review — Analyzers
| Field | Value |
|---|---|
| Module | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Analyzers-001, Analyzers-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found |
| 4 | Error handling & resilience | Analyzers-003 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Analyzers-004 |
| 7 | Design-document adherence | Analyzers-005 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Analyzers-006 |
| 10 | Documentation & comments | Analyzers-007 |
## Findings
### Analyzers-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:135-139` |
| Status | Resolved |
**Description:** `IsInsideWrapperLambda` treats a guarded call as "wrapped" if it is textually inside ANY lambda that is an argument to ANY invocation whose containing type is `CapabilityInvoker` or `AlarmSurfaceInvoker`. It matches the containing type only, never the parameter the lambda is bound to. The real wrapping contract is specifically the `callSite` (`Func<CancellationToken, ValueTask>` / `Func<CancellationToken, ValueTask<T>>`) parameter of `CapabilityInvoker.ExecuteAsync` / `ExecuteWriteAsync`. Any other lambda argument to a method on those types — a future overload that takes a predicate/selector lambda, or a lambda passed in a non-`callSite` position — would suppress the diagnostic even though the guarded call is not actually executed inside the resilience pipeline. The analyzer's own XML doc (lines 21-23) describes exactly this looser-than-intended behaviour. It is a latent false-negative gap rather than an active bug because the current `CapabilityInvoker` surface has no non-`callSite` lambda parameter.
**Recommendation:** Resolve the symbol of the lambda argument's parameter (`IMethodSymbol.Parameters[i]`) and require its type to be the `Func<CancellationToken, ValueTask>` / `Func<CancellationToken, ValueTask<T>>` callsite shape, or at minimum match the wrapper method name (`ExecuteAsync` / `ExecuteWriteAsync`) rather than only the containing type. This closes the gap before a new overload silently widens the escape hatch.
**Resolution:** Resolved 2026-05-22 — Replaced `WrapperTypes` string array with `WrapperMethods` (type FQN + method name) tuples so `IsInsideWrapperLambda` matches both containing type and method name, preventing future non-`callSite` overloads from silently suppressing the diagnostic.
### Analyzers-002
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:46-50,130` |
| Status | Open |
**Description:** `AlarmSurfaceInvoker` is listed in `WrapperTypes`, but `AlarmSurfaceInvoker`'s public methods (`SubscribeAsync`, `UnsubscribeAsync`, `AcknowledgeAsync`) take no lambda arguments at all — callers pass `IReadOnlyList<...>` / `IAlarmSubscriptionHandle`, and the invoker builds the resilience lambdas internally. `IsInsideWrapperLambda` only ever returns `true` when it finds an `AnonymousFunctionExpressionSyntax` argument in the outer call's argument list. Because no `AlarmSurfaceInvoker` call site can have a lambda argument, the `AlarmSurfaceInvoker` entry in `WrapperTypes` is effectively dead — it can never satisfy the suppression condition. Guarded `IAlarmSource` calls written inside `AlarmSurfaceInvoker.cs` are in fact suppressed correctly, but only because they sit inside `CapabilityInvoker.ExecuteAsync` lambdas (the `CapabilityInvoker` entry does the work). The dead entry is misleading and suggests the analyzer recognises an `AlarmSurfaceInvoker` "lambda home" that does not exist.
**Recommendation:** Either remove `AlarmSurfaceInvoker` from `WrapperTypes` (its calls are already covered transitively by the `CapabilityInvoker` match) and update the XML doc, or — if the intent is to allow `IAlarmSource` calls anywhere inside `AlarmSurfaceInvoker` regardless of lambda nesting — add an explicit "call site is lexically within the `AlarmSurfaceInvoker` type declaration" check rather than relying on a lambda-argument scan that never fires.
**Resolution:** _(open)_
### Analyzers-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:80,114-116` |
| Status | Open |
**Description:** `IsInsideWrapperLambda` is passed `context.Operation.SemanticModel` and returns `false` when that model is `null`. A `false` return means "not wrapped", so a null semantic model produces a false-positive diagnostic rather than silently skipping the call. For `RegisterOperationAction` the `SemanticModel` is non-null in normal compilation, so this is low-risk in practice, but the failure mode is the wrong direction — a tooling/IDE edge case where the model is unavailable would flag correct code. Separately, the analyzer has no defensive guard against partially-bound / malformed call sites: `method.ContainingType`, `method.ReturnType`, and `iface.GetMembers()` are dereferenced without null checks. `IInvocationOperation.TargetMethod` is non-null by contract and `ContainingType` is non-null for an ordinary method, so a hard crash is unlikely, but an analyzer that throws on malformed in-progress syntax degrades the IDE experience for the whole solution.
**Recommendation:** When `semanticModel is null` in `AnalyzeInvocation`, return early (skip the call) instead of letting `IsInsideWrapperLambda` report it as unwrapped, so unavailable semantics never produce a false positive.
**Resolution:** _(open)_
### Analyzers-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:95-112` |
| Status | Open |
**Description:** `ImplementsGuardedInterface` runs on every invocation operation in the compilation (every keystroke in the IDE). For each candidate it allocates via `AllInterfaces.Concat(new[] { method.ContainingType })`, builds a fully-qualified display string per interface and calls `string.Replace("global::", ...)`, then for matching interfaces iterates `iface.GetMembers().OfType<IMethodSymbol>()` calling `FindImplementationForInterfaceMember` per member. The `GuardedInterfaces` / `WrapperTypes` lookups are `string[].Contains` (linear scan) rather than a hash set. None of this is catastrophic — the interface sets are tiny — but the work is repeated for every invocation including the overwhelming majority that target non-guarded methods, and the FQN string formatting plus `Replace` allocation on the hot path is avoidable.
**Recommendation:** Move to `RegisterCompilationStartAction`: resolve the guarded interface and wrapper-type symbols once via `Compilation.GetTypeByMetadataName`, capture them, and compare invocation symbols by `SymbolEqualityComparer` identity. Replace the `string[]` membership checks with a `HashSet`. This also makes the analyzer correctly no-op in compilations that do not reference `Core.Abstractions`.
**Resolution:** _(open)_
### Analyzers-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:33-43` |
| Status | Open |
**Description:** `CapabilityInvoker`'s XML doc (`src/Core/.../Resilience/CapabilityInvoker.cs:15-17`) enumerates the routed capability surface as `IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, and all four `IHistoryProvider` reads — matching the analyzer's `GuardedInterfaces` set. However `IHistoryProvider` exposes five async methods, and two of them (`ReadAtTimeAsync`, `ReadEventsAsync`) are C# default-interface-method implementations. When a driver does not override a DIM and a caller invokes it through a concrete driver reference, `FindImplementationForInterfaceMember` returns the interface's own default method symbol; the second equality branch (`method.OriginalDefinition` == `member`) still catches the interface-typed-receiver case, so detection holds — but this DIM interaction is undocumented and untested, and a future driver that overrides one DIM but not the other creates an asymmetric guarded surface that nobody has verified.
**Recommendation:** Add explicit test cases (see Analyzers-006) for `IHistoryProvider` calls via both an interface-typed receiver and a concrete driver that (a) overrides and (b) inherits the default `ReadAtTimeAsync` / `ReadEventsAsync`. If a gap is found, handle DIM members explicitly. Add a short remark to the analyzer XML doc noting the default-interface-method consideration.
**Resolution:** _(open)_
### Analyzers-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers.Tests/UnwrappedCapabilityCallAnalyzerTests.cs` |
| Status | Resolved |
**Description:** The test suite exercises only 3 of the 7 guarded interfaces (`IReadable`, `IWritable`, `ITagDiscovery`) and one positive / one negative lambda case. Significant untested behaviour for an analyzer that gates a repo-wide resilience invariant:
- No test for `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, or `IHistoryProvider` — four of seven guarded interfaces, including the two (`IAlarmSource`, `IHistoryProvider`) with the most subtle wrapping story.
- No test that a synchronous guarded-type member is NOT flagged — `IHostConnectivityProbe.GetHostStatuses()` is explicitly called out in the source comment (lines 75-77) as something the `IsAsyncReturningType` filter must let through, yet there is no regression test pinning that.
- No test for a concrete driver class implementing the interface (the receiver is always the interface type `IReadable driver`); the `FindImplementationForInterfaceMember` branch of `ImplementsGuardedInterface` — the entire reason the source comment claims an unusually-named method implementing `IReadable.ReadAsync` still trips the rule — is never executed by a test.
- No test for `ExecuteWriteAsync` (only `ExecuteAsync` is covered) and no test for `AlarmSurfaceInvoker`.
- No test for nested lambdas or for the generated-code exclusion (`ConfigureGeneratedCodeAnalysis(GeneratedCodeAnalysisFlags.None)`).
- The `StubSources` constant omits `ISubscribable` / `IAlarmSource` / `IHistoryProvider` / `IHostConnectivityProbe` and `AlarmSurfaceInvoker` entirely, so those paths cannot be tested without extending it.
**Recommendation:** Extend `StubSources` with the remaining guarded interfaces and `AlarmSurfaceInvoker`, then add tests for: each remaining guarded interface (positive plus wrapped), a synchronous member not being flagged, a concrete driver-class receiver with a renamed implementing method, `ExecuteWriteAsync` wrapping, and a nested-lambda case.
**Resolution:** Resolved 2026-05-22 — Extended `StubSources` with `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`, and `AlarmSurfaceInvoker` stubs; added 14 new tests covering each missing guarded interface (positive + wrapped), synchronous member not flagged, concrete driver receiver, `ExecuteWriteAsync` wrapping, and nested-lambda cases (19 tests total, all passing).
### Analyzers-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:21-26` |
| Status | Open |
**Description:** The `<remarks>` block states the analyzer "matches by receiver-interface identity using Roslyn's semantic model, not by method name". This is accurate for the guarded-call detection (`ImplementsGuardedInterface` uses symbols), but the wrapper detection in `IsInsideWrapperLambda` is described in the same block as walking the syntax tree and checking enclosing invocations by containing type — and that detection is in fact looser than the prose implies (see Analyzers-001): it does not verify the lambda is bound to the resilience `callSite` parameter. The XML doc reads as if the wrapper match is precise. The `<remarks>` also notes the rule does not enforce the capability argument matches the method, but omits the more important current limitation — that a lambda in any argument position of a wrapper-typed call suppresses the diagnostic.
**Recommendation:** Tighten the `<remarks>` to state precisely what `IsInsideWrapperLambda` checks today (textual containment within a lambda argument of a `CapabilityInvoker` / `AlarmSurfaceInvoker`-typed invocation), and note the known limitation that it does not bind the lambda to the `callSite` parameter. Keep the doc in sync if Analyzers-001 is fixed.
**Resolution:** _(open)_
+271
View File
@@ -0,0 +1,271 @@
# Code Review — Client.CLI
| Field | Value |
|---|---|
| Module | `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 8 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Client.CLI-001, Client.CLI-002, Client.CLI-003 |
| 2 | OtOpcUa conventions | Client.CLI-004 |
| 3 | Concurrency & thread safety | Client.CLI-005 |
| 4 | Error handling & resilience | Client.CLI-006 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Client.CLI-007 |
| 7 | Design-document adherence | Client.CLI-008 |
| 8 | Code organization & conventions | Client.CLI-009 |
| 9 | Testing coverage | Client.CLI-010 |
| 10 | Documentation & comments | Client.CLI-008 |
## Findings
### Client.CLI-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `Commands/HistoryReadCommand.cs:73`, `Commands/HistoryReadCommand.cs:76` |
| Status | Resolved |
**Description:** The start and end options are parsed with `DateTime.Parse(StartTime)` with
no `IFormatProvider` or `DateTimeStyles`. Parsing therefore depends on the current OS
culture: the same `--start "03/04/2026"` resolves to March 4 on an en-US box and April 3
on an en-GB box. The CLI is documented as cross-platform and the value silently produces a
different (wrong) history window rather than failing. The doc claims "ISO 8601 or date
string" but ISO interpretation is not guaranteed without `DateTimeStyles.RoundtripKind` or
`CultureInfo.InvariantCulture`. A bare date string is also assumed to be local time, then
`.ToUniversalTime()` shifts it by the host offset, so the same input yields different
ranges on machines in different time zones.
**Recommendation:** Parse with `CultureInfo.InvariantCulture` and
`DateTimeStyles.AssumeUniversal | DateTimeStyles.AdjustToUniversal` (or require explicit
ISO 8601 via `DateTimeOffset.Parse`), and document the expected format and timezone
assumption precisely.
**Resolution:** Resolved 2026-05-22 — `DateTime.Parse` replaced with `CultureInfo.InvariantCulture` + `DateTimeStyles.AssumeUniversal | AdjustToUniversal`; option descriptions updated to document ISO 8601 UTC format.
### Client.CLI-002
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Commands/SubscribeCommand.cs:129-137` |
| Status | Open |
**Description:** The summary computes `neverWentBad` as every target whose node-id key is
absent from the `everBad` dictionary. A node that received no update at all is also absent
from `everBad`, so it is counted in `neverWentBad` and printed under the heading
"--- Nodes that NEVER received a bad-quality update (suspect) ---". The same node is also
listed separately under `never` ("never received an update at all"). Labeling a node that
produced zero notifications as a "suspect that never went bad" is misleading — it has not
been observed at all, which is a different (and arguably worse) condition than a node that
streamed only good values.
**Recommendation:** Exclude no-update nodes from the `neverWentBad` set, e.g.
`targets.Where(t => lastStatus.ContainsKey(key) && !everBad.ContainsKey(key))`, so the
"suspect" list only contains nodes that were actually observed and never reported bad
quality.
**Resolution:** _(open)_
### Client.CLI-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Commands/BrowseCommand.cs:29-30`, `Commands/SubscribeCommand.cs:20-27`, `Commands/AlarmsCommand.cs:28-29`, `Commands/HistoryReadCommand.cs:42-43` |
| Status | Open |
**Description:** Numeric command options accept any value with no range validation.
`--depth`, `--interval`, `--max-depth`, `--max`, and the history `--interval` can all be
supplied as `0` or a negative number. A negative `--depth`/`--max-depth` silently disables
recursion or under-traverses; a zero/negative sampling `--interval` is passed straight
through to `SubscribeAsync` and depends on the SDK/server to reject it; a negative `--max`
is forwarded to `HistoryReadRawAsync`. None of these produce a clear operator-facing error.
**Recommendation:** Validate option ranges at the start of `ExecuteAsync` and throw
`CliFx.Exceptions.CommandException` with an actionable message when a value is out of
range.
**Resolution:** _(open)_
### Client.CLI-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `Commands/SubscribeCommand.cs:13-37` |
| Status | Open |
**Description:** `SubscribeCommand` is the only command in the module whose constructor
and all `[CommandOption]` properties have no XML doc comments. Every other command
(`ConnectCommand`, `ReadCommand`, `WriteCommand`, `BrowseCommand`, `AlarmsCommand`,
`HistoryReadCommand`, `RedundancyCommand`) and `CommandBase` carry `<summary>` docs on the
type, constructor, and options. The inconsistency is visible in IDE tooltips and breaks the
otherwise-uniform documentation convention of the module.
**Recommendation:** Add `<summary>` XML docs to the `SubscribeCommand` constructor and to
each of its option properties, matching the style used by the sibling commands.
**Resolution:** _(open)_
### Client.CLI-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `Commands/SubscribeCommand.cs:66-78`, `Commands/AlarmsCommand.cs:52-64` |
| Status | Resolved |
**Description:** The `DataChanged` and `AlarmEvent` handlers write to `console.Output`
(a `System.IO.TextWriter`) directly from the OPC UA SDK subscription/notification thread,
while the command main flow is awaiting `Task.Delay(Timeout.Infinite, ct)` and the summary
block also writes to the same `console.Output`. `TextWriter` instances are not guaranteed
thread-safe; concurrent `WriteLine` calls from the notification thread and the main thread
(a data-change notification arriving while the summary is being printed, or two
notifications from different SDK threads) can interleave or corrupt output. The handler
also calls the synchronous `WriteLine` and discards any exception, which on a fault would
propagate into the SDK callback.
**Recommendation:** Serialize console writes from event handlers — funnel notifications
through a `Channel<T>` drained by the main thread, or guard every `console.Output` write
with a shared lock. At minimum, ensure handler exceptions cannot escape into the SDK
callback.
**Resolution:** Resolved 2026-05-22 — notification handlers in `SubscribeCommand` and `AlarmsCommand` now enqueue lines to an `UnboundedChannel<string>` via `TryWrite`; the main thread drains the channel via `ReadAllAsync`. Handlers are named local functions so they can be unsubscribed before the summary phase; all handler exceptions are swallowed to protect the SDK callback.
### Client.CLI-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `Commands/HistoryReadCommand.cs:73`, `Commands/HistoryReadCommand.cs:76`, `Helpers/NodeIdParser.cs:39` |
| Status | Open |
**Description:** Operator input-format errors surface as raw .NET exceptions rather than
clean CLI errors. An unparseable start/end value throws `FormatException` straight out of
`DateTime.Parse`; an invalid node id throws `FormatException`/`ArgumentException` from
`NodeIdParser`. CliFx renders unhandled exceptions with a stack trace, which is noisy for a
user-input mistake. Other tooling in this module already distinguishes operator errors
(`ParseAggregateType` throws `ArgumentException` with a helpful message) but none of these
is converted to a `CliFx.Exceptions.CommandException` with a clean exit code.
**Recommendation:** Catch the predictable input-validation exceptions and rethrow as
`CommandException` with a concise message and a non-zero exit code, so malformed input
yields a one-line error instead of a stack trace.
**Resolution:** _(open)_
### Client.CLI-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `CommandBase.cs:112-123` |
| Status | Open |
**Description:** `ConfigureLogging` builds a new Serilog `LoggerConfiguration`, creates a
logger, and assigns it to the static `Log.Logger` without disposing the previously
assigned logger. For a single CLI invocation this leaks at most one logger and the process
exits shortly after, so impact is minimal — but `CommandBase` is also exercised repeatedly
in-process by the unit-test suite, where each `ExecuteAsync` replaces `Log.Logger` and
abandons the prior console sink without disposal. The pattern is incorrect:
`Log.CloseAndFlush()` (or disposing the prior logger) should run before reassignment.
**Recommendation:** Call `Log.CloseAndFlush()` before assigning a new `Log.Logger`, or
build the logger into a local `ILogger` the command owns and disposes, rather than mutating
global static state per command.
**Resolution:** _(open)_
### Client.CLI-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `docs/Client.CLI.md:158-217` |
| Status | Open |
**Description:** `docs/Client.CLI.md` is stale relative to the code at this commit.
(1) The `subscribe` command section documents only `-n` and `-i`, but the code
(`SubscribeCommand`) also exposes `-r/--recursive`, `--max-depth`, `-q/--quiet`,
`--duration`, and `--summary-file` — none are documented, and the documented Ctrl+C-only
lifecycle no longer matches `--duration` auto-exit.
(2) The `historyread` "Aggregate mapping" table lists six aggregates but the code
(`HistoryReadCommand.ParseAggregateType` and `AggregateType`) also supports
`StandardDeviation` (aliases `stddev`/`stdev`); the doc option table omits it while the
code option description includes it.
**Recommendation:** Regenerate the `subscribe` and `historyread` sections of
`docs/Client.CLI.md` from the current option set, including the five new subscribe flags
and the `StandardDeviation` aggregate row.
**Resolution:** _(open)_
### Client.CLI-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `Commands/SubscribeCommand.cs:66-165`, `Commands/AlarmsCommand.cs:52-91` |
| Status | Open |
**Description:** Both long-running commands attach an event handler
(`service.DataChanged += ...`, `service.AlarmEvent += ...`) with a lambda and never detach
it. Because the handler closes over `console`, the captured console and the closure remain
referenced by the service until the service is disposed in the `finally` block. In
practice the service is per-command and disposed at the end, so this does not leak across
commands — but it is a latent footgun: a handler can still fire between `UnsubscribeAsync`
/ `UnsubscribeAlarmsAsync` and `Dispose`, writing to a console that the command considers
finished (overlapping with Client.CLI-005). The cleanup unsubscribes the monitored items
but never the .NET event.
**Recommendation:** Detach the handler explicitly (`service.DataChanged -= handler`) after
unsubscribing, using a named local delegate so it can be removed, ensuring no notification
is processed after the command output phase ends.
**Resolution:** _(open)_
### Client.CLI-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Client/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests/SubscribeCommandTests.cs` |
| Status | Open |
**Description:** The new `SubscribeCommand` capabilities are largely untested. The four
`SubscribeCommandTests` cover only single-node subscribe, unsubscribe-on-cancel,
disconnect-in-finally, and the subscription message. There is no test for the `--recursive`
browse-and-collect path (`CollectVariablesAsync`), the `--duration` auto-exit path, the
summary classification logic (`good`/`bad`/`never`/`neverWentBad`, including the
mislabeling noted in Client.CLI-002), the `--quiet` flag, the `--summary-file` write, or
per-node subscribe-failure handling. The summary logic is the most behaviour-rich part of
the command and the part most likely to regress.
**Recommendation:** Add unit tests for recursive variable collection, the duration-based
exit, summary bucketing across good/bad/no-update nodes, and the `--summary-file` output.
The `FakeOpcUaClientService` already exposes `RaiseDataChanged`, so feeding good/bad values
and asserting the summary text is straightforward.
**Resolution:** _(open)_
+192
View File
@@ -0,0 +1,192 @@
# Code Review — Client.Shared
| Field | Value |
|---|---|
| Module | `src/Client/ZB.MOM.WW.OtOpcUa.Client.Shared` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Client.Shared-001, Client.Shared-002, Client.Shared-003 |
| 2 | OtOpcUa conventions | Client.Shared-004 |
| 3 | Concurrency & thread safety | Client.Shared-005, Client.Shared-006, Client.Shared-007 |
| 4 | Error handling & resilience | Client.Shared-008, Client.Shared-009 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Client.Shared-010 |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Client.Shared-011 |
| 10 | Documentation & comments | Client.Shared-009 |
## Findings
### Client.Shared-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientService.cs:552` |
| Status | Resolved |
**Description:** `OnAlarmEventNotification` returns early when `eventFields.EventFields` has fewer than 6 entries. The event filter built by `CreateAlarmEventFilter` always registers 13 select clauses, so a conforming server returns 13 fields. The `< 6` threshold is arbitrary and inconsistent: SourceName is index 2 and Severity index 5, but ConditionName (6), Retain (7), Acked/Active (8/9) and ConditionNodeId (12) are all needed for a usable alarm and are each guarded individually with `fields.Count > N`. A non-conforming server that returns a truncated list (or fewer fields than requested) makes the `< 6` early return silently drop the entire notification, including the EventId/SourceName/Severity that are present.
**Recommendation:** Drop the `< 6` early return (or lower it to `< 1`) and rely on the existing per-index `fields.Count > N` guards, which already default missing fields safely. If a hard floor is wanted, document why 6 and not 13.
**Resolution:** Resolved 2026-05-22 — lowered the early-return threshold to `< 1` (null or empty guard only); per-index `fields.Count > N` guards already default missing fields safely for all higher indices.
### Client.Shared-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientService.cs:351-355`, `OpcUaClientService.cs:373` |
| Status | Resolved |
**Description:** `GetRedundancyInfoAsync` performs unguarded unboxing casts on values read from the server: `(int)redundancySupportValue.Value` and `(byte)serviceLevelValue.Value`. Unlike the `ServerUriArray`/`ServerArray` reads below them, the `RedundancySupport` and `ServiceLevel` reads are not wrapped in try/catch. If the server returns the value boxed as a different numeric type than expected (e.g. `ServiceLevel` boxed as `int` instead of `byte`), or returns a null `Value` on a `Bad` DataValue, the cast throws `InvalidCastException`/`NullReferenceException` and the whole call fails instead of returning a sensible default.
**Recommendation:** Wrap the `RedundancySupport` and `ServiceLevel` reads in the same defensive pattern used for the array reads, using `Convert.ToInt32`/`Convert.ToByte` on the boxed value and falling back to `None`/`0` when the read status is bad or the value is null.
**Resolution:** Resolved 2026-05-22 — replaced direct casts with `StatusCode.IsGood` guard + `Convert.ToInt32`/`Convert.ToByte` coercion; falls back to `None`/`0` when status is bad or value is null.
### Client.Shared-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Adapters/DefaultSessionAdapter.cs:76`, `Adapters/DefaultSessionAdapter.cs:273` |
| Status | Open |
**Description:** `WriteValueAsync` returns `response.Results[0]` and `CallMethodAsync` reads `result.Results[0]` without first checking the `Results` collection is non-empty. A malformed or service-level-faulted response (empty `Results` alongside a service fault) produces an `IndexOutOfRangeException` rather than a meaningful OPC UA `StatusCode` or `ServiceResultException`.
**Recommendation:** Guard both accesses — throw `ServiceResultException` with the response's `ResponseHeader.ServiceResult` (or `BadUnexpectedError`) when `Results` is empty.
**Resolution:** _(open)_
### Client.Shared-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `Adapters/DefaultSessionAdapter.cs:228`, `Adapters/DefaultSessionAdapter.cs:121`, `Adapters/DefaultSessionAdapter.cs:172` |
| Status | Open |
**Description:** `CloseAsync`, `HistoryReadRawAsync`, and `HistoryReadAggregateAsync` are declared `async Task` but call the synchronous `Session.Close()` / `Session.HistoryRead(...)` APIs and contain no `await`. The history methods run a blocking synchronous service round-trip on the caller's thread; for the UI this blocks the dispatcher thread. The async signature misleads callers, and the `CancellationToken` parameter is ignored on these paths.
**Recommendation:** Use the stack's async overloads (`Session.HistoryReadAsync`, `Session.CloseAsync`) where available, or wrap the synchronous calls in `Task.Run`, so the methods are genuinely asynchronous and honor the cancellation token.
**Resolution:** _(open)_
### Client.Shared-005
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:19`, `OpcUaClientService.cs:226-249`, `OpcUaClientService.cs:499-521` |
| Status | Resolved |
**Description:** `_activeDataSubscriptions` is a plain `Dictionary` mutated from at least three thread contexts with no synchronization: the caller thread (`SubscribeAsync`/`UnsubscribeAsync`), the keep-alive callback thread (`HandleKeepAliveFailureAsync` -> `ReplaySubscriptionsAsync`, invoked fire-and-forget from the OPC UA `KeepAlive` event), and `DisconnectAsync`. Concurrent `Add`/`Remove`/`Clear`/enumeration on a non-thread-safe `Dictionary` can corrupt its internal buckets, throw `InvalidOperationException`, or lose entries. A failover firing while the UI calls `SubscribeAsync` is a realistic trigger. The `_activeAlarmSubscription` nullable tuple has the same exposure.
**Recommendation:** Guard all access to `_activeDataSubscriptions` / `_activeAlarmSubscription` (and the `_session`/`_dataSubscription`/`_alarmSubscription` fields) with a single lock, or move subscription bookkeeping behind a `ConcurrentDictionary` plus a lock for the multi-field failover transition.
**Resolution:** Resolved 2026-05-22 — added a dedicated `_subscriptionLock` and wrapped every read/write of `_activeDataSubscriptions` and `_activeAlarmSubscription` (in Subscribe/Unsubscribe[Alarms]Async, Disconnect, Dispose, and the snapshot/clear/re-record steps of ReplaySubscriptionsAsync) inside it; awaited adapter calls run outside the lock to avoid holding it across I/O.
### Client.Shared-006
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:97-100`, `OpcUaClientService.cs:432-497` |
| Status | Resolved |
**Description:** `HandleKeepAliveFailureAsync` is launched fire-and-forget (`_ = HandleKeepAliveFailureAsync()`) from every bad keep-alive callback. The only guard against re-entry is the non-atomic check `if (_state == Reconnecting || _state == Disconnected) return;` at the top. Between that read and the `TransitionState(Reconnecting, ...)` write a few lines later, a second keep-alive failure (the SDK raises `KeepAlive` repeatedly while a session is down) can pass the same guard, and two failover loops run concurrently — each disposing `_session`, nulling subscription fields, and racing to assign a new `_session`. The session created by the loser leaks, and `ReplaySubscriptionsAsync` can run twice creating duplicate monitored items.
**Recommendation:** Serialize failover with an `Interlocked.CompareExchange` flag or a `SemaphoreSlim(1,1)` so only one failover loop runs at a time; subsequent keep-alive failures during an in-flight failover should be ignored. Make the state transition atomic with the re-entry guard.
**Resolution:** Resolved 2026-05-22 — `HandleKeepAliveFailureAsync` now claims an atomic `_failoverInProgress` slot via `Interlocked.CompareExchange(ref _failoverInProgress, 1, 0)`; a re-entrant bad keep-alive sees `1` and returns immediately, so only one failover loop runs. The loop body moved to `RunFailoverAsync`, wrapped in try/finally that resets the flag with `Interlocked.Exchange`.
### Client.Shared-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientService.cs:581-622` |
| Status | Resolved |
**Description:** In the alarm fallback path, the `Task.Run` closure mutates the captured locals `activeState`, `ackedState`, `time`, and `capturedMessage`, then reads them when invoking `AlarmEvent`. Because the captured `_session` reference can be replaced by a concurrent failover (see Client.Shared-006), the supplemental `ReadValueAsync` calls may run against a session being disposed, throwing `ObjectDisposedException` — caught by the bare `catch`, after which the alarm is delivered with default (false/MinValue) states, silently misreporting it as inactive/unacknowledged. The notification callback also has no back-pressure: a burst of alarm events spawns an unbounded number of `Task.Run` continuations each doing 3-4 server round-trips.
**Recommendation:** Capture the session under the same lock proposed in Client.Shared-005 and skip the supplemental read if the session has changed or is disposed. Consider batching the four sequential `ReadValueAsync` calls into one `Read` request.
**Resolution:** Resolved 2026-05-22 — added a `ReferenceEquals(session, _session)` guard at the top of the `Task.Run` body to skip reads if the session was replaced by failover; separated `ObjectDisposedException` from the general catch to drop rather than deliver the stale alarm.
### Client.Shared-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientService.cs:170-180`, `Helpers/ValueConverter.cs:15-31` |
| Status | Resolved |
**Description:** `WriteValueAsync` coerces a string input to the target type by reading the node's current value and inferring the type from `currentDataValue.Value`. When the node has never been written, or the read returns a `Bad` status with a null `Value`, `ValueConverter.ConvertValue` falls through to the `_ => rawValue` default and writes a raw `string` into, for example, an `Int32` node — the server then rejects it with `BadTypeMismatch`, surfacing as a confusing failure unrelated to the operator's input. Separately, `ConvertValue` uses `bool.Parse`, which accepts only `true`/`false` — operator input of `1`/`0` throws `FormatException` that propagates raw to the caller. The read-before-write also doubles the round-trip cost of every string write.
**Recommendation:** Inspect `currentDataValue.StatusCode` before trusting `Value`; when the type cannot be inferred, surface a clear error rather than writing a mistyped value. Make boolean parsing accept `1`/`0`/`yes`/`no`, and wrap parse failures in a descriptive exception naming the node and target type.
**Resolution:** Resolved 2026-05-22 — `WriteValueAsync` now checks `StatusCode.IsGood` and non-null `Value` before calling `ConvertValue`, throwing a descriptive `InvalidOperationException` on bad reads; `ValueConverter` now uses a `ParseBool` helper accepting `true/false/1/0/yes/no` (case-insensitive) and wraps all parse/overflow failures in a `FormatException` with the target type and input value in the message.
### Client.Shared-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience / Documentation & comments |
| Location | `OpcUaClientService.cs:302-322` |
| Status | Open |
**Description:** `AcknowledgeAlarmAsync` is typed `Task<StatusCode>` and its XML doc implies the returned code reports the ack outcome, but the method unconditionally `return StatusCodes.Good`. The actual failure path is `DefaultSessionAdapter.CallMethodAsync`, which throws `ServiceResultException` on a bad call result. A failed acknowledgment therefore never returns a bad `StatusCode` — it throws — and the `StatusCode` return value is dead. Callers writing `if (StatusCode.IsBad(result))` will never see a bad result and will not catch the exception.
**Recommendation:** Either change the return type to `Task` (and let exceptions signal failure), or catch `ServiceResultException` in `AcknowledgeAlarmAsync` and return its `StatusCode`. Update the XML doc to match whichever is chosen.
**Resolution:** _(open)_
### Client.Shared-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Models/ConnectionSettings.cs:48`, `OpcUaClientService.cs:408-417` |
| Status | Open |
**Description:** `ConnectionSettings.CertificateStorePath` is initialized to `ClientStoragePaths.GetPkiPath()` as a property initializer, so every `ConnectionSettings` instantiation runs `Environment.GetFolderPath` + `Path.Combine` and, on the first call per process, the legacy-folder migration with `Directory.Exists`/`Directory.Move` filesystem IO. `ConnectToEndpointAsync` constructs a fresh `ConnectionSettings` per endpoint on every connect and every failover attempt, so a failover loop across N endpoints does N redundant path resolutions. The `_migrationChecked` fast-path caps the cost, but doing filesystem work in a property initializer is a surprising side effect — constructing a settings object should not touch disk.
**Recommendation:** Make `CertificateStorePath` default to `string.Empty` and resolve `ClientStoragePaths.GetPkiPath()` lazily inside `DefaultApplicationConfigurationFactory.CreateAsync` only when the path is unset.
**Resolution:** _(open)_
### Client.Shared-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Client/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests/OpcUaClientServiceTests.cs` |
| Status | Open |
**Description:** The test suite is solid for the happy paths, connection lifecycle, and single-failover behavior. Gaps relative to the findings above: (a) no test exercises concurrent `SubscribeAsync`/failover to expose the `_activeDataSubscriptions` race (Client.Shared-005) or re-entrant keep-alive failures (Client.Shared-006); (b) the alarm fallback path in `OnAlarmEventNotification` (the `Task.Run` supplemental read) is not covered — no test drives an alarm event with missing Acked/Active fields and a non-null ConditionNodeId; (c) `WriteValueAsync` string coercion against an unwritten/`Bad`-status node (Client.Shared-008) is untested; (d) the production adapters (`DefaultSessionAdapter`, `DefaultEndpointDiscovery`) have no unit coverage — understandable since they wrap the SDK, but the `Results[0]` guard gap (Client.Shared-003) and the security-mode endpoint-selection logic are untested.
**Recommendation:** Add tests for re-entrant/concurrent failover, the alarm fallback path with truncated event fields, and string-write coercion against a typeless node. Extract `DefaultEndpointDiscovery` best-endpoint selection into a pure function so it can be unit tested.
**Resolution:** _(open)_
+296
View File
@@ -0,0 +1,296 @@
# Code Review - Client.UI
| Field | Value |
|---|---|
| Module | `src/Client/ZB.MOM.WW.OtOpcUa.Client.UI` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Client.UI-001, Client.UI-002 |
| 2 | OtOpcUa conventions | Client.UI-003, Client.UI-004 |
| 3 | Concurrency & thread safety | Client.UI-005 |
| 4 | Error handling & resilience | Client.UI-006 |
| 5 | Security | Client.UI-007 |
| 6 | Performance & resource management | Client.UI-008 |
| 7 | Design-document adherence | Client.UI-009 |
| 8 | Code organization & conventions | Client.UI-010 |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Client.UI-011 |
## Findings
### Client.UI-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ViewModels/HistoryViewModel.cs:76`, `ViewModels/HistoryViewModel.cs:77` |
| Status | Resolved |
**Description:** `ReadHistoryAsync` runs as a `RelayCommand` body, which is invoked
on the UI thread, so the bare `IsLoading = true` at line 76 happens to land on the
right thread today. But `Results.Clear()` on the very next line is wrapped in
`_dispatcher.Post(...)`, and the `finally` block also sets `IsLoading` through the
dispatcher (`_dispatcher.Post(() => IsLoading = false)` at line 121). The two
`IsLoading` writes use inconsistent dispatch paths. Because the `Post` in the
`finally` is queued behind the result-population `Post` while the synchronous
line-76 write is not, the loading-indicator updates are not guaranteed to be
ordered relative to the grid population, and the pattern is fragile if the command
is ever invoked off the UI thread (a future caller or test harness).
**Recommendation:** Route the line-76 `IsLoading = true` through `_dispatcher.Post`
for consistency with the rest of the method, or set both `IsLoading` writes
synchronously and only dispatch the `ObservableCollection` mutations.
**Resolution:** Resolved 2026-05-22 — Routed the `IsLoading = true` write through `_dispatcher.Post` to make both `IsLoading` assignments consistent with all other UI state mutations in the method.
### Client.UI-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ViewModels/MainWindowViewModel.cs:255`, `ViewModels/MainWindowViewModel.cs:333` |
| Status | Resolved |
**Description:** `ConnectAsync` calls `await BrowseTree.LoadRootsAsync()` and
`ViewHistoryForSelectedNode` calls `History.SelectedNodeId = ...` by dereferencing
the nullable child view-model properties (`BrowseTreeViewModel?`,
`HistoryViewModel?`) without a null check or `!` operator, while the surrounding
code (lines 258-266) does guard `Subscriptions` and `Alarms` with `!= null`.
`InitializeService()` does assign all five child VMs before these lines run, so a
real NRE is unlikely on the current call path, but the inconsistent guarding masks
intent and the nullable-reference compiler flow analysis cannot prove
`InitializeService()` set the field, so this either produces a CS8602 warning that
is being ignored or relies on warnings being suppressed. A future refactor that
makes `InitializeService()` conditionally skip a VM would introduce a silent crash.
**Recommendation:** Make the guarding consistent: either guard all five child VMs
uniformly, or have `InitializeService()` return non-null references the caller uses
directly so the compiler can prove non-nullness.
**Resolution:** Resolved 2026-05-22 — Added `if (BrowseTree != null)` and `if (History != null)` guards at both dereference sites to match the guarding style already used for `Subscriptions` and `Alarms`.
### Client.UI-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `ZB.MOM.WW.OtOpcUa.Client.UI.csproj:20-21`, `Program.cs:14-20` |
| Status | Open |
**Description:** The csproj references `Serilog` and `Serilog.Sinks.Console`, and
`docs/Client.UI.md` lists Serilog as the logging technology, but no source file in
the module uses Serilog. `Program.BuildAvaloniaApp()` uses Avalonia's
`LogToTrace()` and there is no logger configuration, no log calls, and no rolling
file sink. `CLAUDE.md` mandates "Serilog with rolling daily file sink" as the
logging library preference. The references are dead weight and the documented
logging behaviour does not exist.
**Recommendation:** Either wire up Serilog (a console sink at minimum, ideally the
rolling daily file sink the project standard calls for) and route Avalonia logging
through it, or drop the unused `Serilog` package references and correct
`docs/Client.UI.md`.
**Resolution:** _(open)_
### Client.UI-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `Views/MainWindow.axaml.cs:125-138` |
| Status | Open |
**Description:** `OnBrowseCertPathClicked` uses `OpenFolderDialog`, which is
obsolete in Avalonia 11.x (the version pinned in the csproj). The supported
replacement is the `StorageProvider` API
(`StorageProvider.OpenFolderPickerAsync`). Using the obsolete type produces a
compiler obsoletion warning and the API is scheduled for removal in a future
Avalonia major version.
**Recommendation:** Migrate the folder chooser to
`TopLevel.GetTopLevel(this).StorageProvider.OpenFolderPickerAsync(...)`.
**Resolution:** _(open)_
### Client.UI-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `ViewModels/MainWindowViewModel.cs:286-304`, `ViewModels/MainWindowViewModel.cs:155-189` |
| Status | Resolved |
**Description:** `SubscriptionsViewModel` and `AlarmsViewModel` attach handlers to
the long-lived `_service` events (`DataChanged`, `AlarmEvent`) in their
constructors and detach them only via `Teardown()`. `Teardown()` is called from
`DisconnectAsync` (operator-initiated disconnect), but it is NOT called from the
`OnConnectionStateChanged` partial method that handles the `Disconnected` state;
that path only calls `Clear()`. When the connection drops server-side (session
lost, network failure) the service raises `ConnectionStateChanged(Disconnected)`
without `DisconnectAsync` ever running, so the alarm/data event handlers remain
attached to a dead service. They are not re-attached on the next connect because
`InitializeService()` early-returns when `_service != null` and the same VM
instances are reused, so there is no handler leak per reconnect, but a late or
buffered `DataChanged`/`AlarmEvent` callback fired during teardown will still mutate
`ObservableCollection`s, and the asymmetry between the two disconnect paths is a
latent correctness hazard.
**Recommendation:** Make the disconnect handling symmetric: call
`Subscriptions?.Teardown()` / `Alarms?.Teardown()` (or otherwise quiesce the event
handlers) from the `Disconnected` branch of the `OnConnectionStateChanged` partial
method, not only from `DisconnectAsync`.
**Resolution:** Resolved 2026-05-22 — Added `Teardown()` calls to the `Disconnected` branch and added `Reattach()` methods (idempotent remove+add) called from the `Connected` branch to restore handlers after a server-side drop + reconnect.
### Client.UI-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `ViewModels/MainWindowViewModel.cs:244-252`, `ViewModels/AlarmsViewModel.cs:88-112`, `ViewModels/SubscriptionsViewModel.cs:79-94` |
| Status | Open |
**Description:** Many catch blocks swallow exceptions silently with an empty body
and only a comment (`// Redundancy info not available`, `// Subscribe failed`,
`// Subscription failed; no item added`, and others). When a subscribe,
alarm-subscribe, or redundancy read fails, the operator gets no feedback at all: no
status message, no log entry (compounded by Client.UI-003: there is no logger). A
failed `AddSubscriptionAsync` simply leaves the node un-subscribed with no
indication why. This makes field diagnosis of a misconfigured server or a
permission denial effectively impossible from the UI.
**Recommendation:** Surface failures to the operator: at minimum set a status
message or write the exception to a log. Distinguish "feature not supported"
(condition refresh) from "operation failed" so genuine errors are not hidden.
**Resolution:** _(open)_
### Client.UI-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `Services/UserSettings.cs:22-23`, `Services/JsonSettingsService.cs:38-50`, `ViewModels/MainWindowViewModel.cs:393-408` |
| Status | Resolved |
**Description:** The OPC UA `UserName`-token password is persisted in cleartext.
`UserSettings.Password` is a plain `string`, `JsonSettingsService.Save` serializes
the whole settings object to `settings.json` under `LocalApplicationData`, and
`SaveSettings()` is invoked after every successful connect and on window close. Any
process or user able to read the current user's profile directory can recover the
server credentials. `docs/Client.UI.md` documents that "All connection parameters"
are persisted but does not flag the password among them.
**Recommendation:** Do not persist the password in cleartext. Options: omit it from
the persisted model entirely (re-prompt each launch); encrypt it at rest with
`ProtectedData` (DPAPI) on Windows or an equivalent OS keystore on other platforms;
or store only a non-reversible reference. At minimum, document the cleartext
storage as a known limitation.
**Resolution:** Resolved 2026-05-22 — Removed `Password` from `UserSettings` and stopped writing/reading it in `SaveSettings`/`LoadSettings`; the operator is re-prompted each launch.
### Client.UI-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `ViewModels/MainWindowViewModel.cs:18`, `ViewModels/MainWindowViewModel.cs:125-148`, `App.axaml.cs:18-32` |
| Status | Resolved |
**Description:** `IOpcUaClientService` is declared `IDisposable`
(`IOpcUaClientService.cs:10`), and the concrete service owns an OPC UA session plus
SDK resources. `MainWindowViewModel` holds `_service` for the lifetime of the app
but never calls `_service.Dispose()`: not on window close, not on disconnect, not
anywhere. `DisconnectAsync` calls `DisconnectAsync()` on the service but leaves the
object undisposed, and there is no `IDisposable` implementation on
`MainWindowViewModel` itself. The OPC UA SDK session, certificate validator, and
any background reconnect timers are leaked until process exit. The
`ConnectionStateChanged` handler attached at line 130 is also never detached.
**Recommendation:** Make `MainWindowViewModel` implement `IDisposable`, detach the
`ConnectionStateChanged` handler, and dispose `_service` from `MainWindow.OnClosing`
(alongside the existing `SaveSettings()` call).
**Resolution:** Resolved 2026-05-22 — Added `IDisposable` to `MainWindowViewModel` with a `Dispose()` that detaches `ConnectionStateChanged`, calls `Teardown()` on child VMs, and calls `_service.Dispose()`; wired `Dispose()` into `MainWindow.OnClosing`.
### Client.UI-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ViewModels/HistoryViewModel.cs:44-54` |
| Status | Open |
**Description:** `HistoryViewModel.AggregateTypes` exposes eight entries: `null`
(Raw) plus Average, Minimum, Maximum, Count, Start, End, and `StandardDeviation`.
`docs/Client.UI.md` ("Query Options" table) lists only "Raw (default), Average,
Minimum, Maximum, Count, Start, End" and omits `StandardDeviation`. The doc is
stale relative to the code.
**Recommendation:** Update the "Aggregate" row in `docs/Client.UI.md` to include
Standard Deviation.
**Resolution:** _(open)_
### Client.UI-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `Controls/DateTimeRangePicker.axaml.cs:33-37`, `Controls/DateTimeRangePicker.axaml.cs:70-80` |
| Status | Open |
**Description:** `DateTimeRangePicker` declares `MinDateTimeProperty` /
`MaxDateTimeProperty` styled properties with public CLR accessors, but neither is
read anywhere in the control. `TryParseDateTime`, `OnStartLostFocus`, and
`OnEndLostFocus` never clamp or reject input against the min/max bounds, and no
XAML binds them. The properties are dead API surface that implies a range
constraint the control does not enforce.
**Recommendation:** Either implement min/max validation in the `LostFocus` parse
path (turn out-of-range input red, as invalid input already is) or remove the two
unused styled properties.
**Resolution:** _(open)_
### Client.UI-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `Views/MainWindow.axaml:81`, `Services/JsonSettingsService.cs:11-15` |
| Status | Open |
**Description:** The certificate-store-path `TextBox` watermark reads
`(default: AppData/LmxOpcUaClient/pki)`, referencing the legacy pre-task-#208
folder name. Per `CLAUDE.md` / `docs/Client.UI.md` the canonical path is now
`{LocalAppData}/OtOpcUaClient/`, and `ClientStoragePaths` migrates the old
`LmxOpcUaClient/` folder forward. The watermark shows operators an obsolete path
that no longer matches where settings and the PKI store actually live.
**Recommendation:** Update the watermark to reference `OtOpcUaClient/pki`, or bind
it to `ClientStoragePaths.GetPkiPath()` so it cannot drift again.
**Resolution:** _(open)_
+192
View File
@@ -0,0 +1,192 @@
# Code Review — Configuration
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Configuration-001, Configuration-002, Configuration-003 |
| 2 | OtOpcUa conventions | Configuration-004 |
| 3 | Concurrency & thread safety | Configuration-005 |
| 4 | Error handling & resilience | Configuration-006, Configuration-007 |
| 5 | Security | Configuration-008, Configuration-009, Configuration-010 |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Configuration-011 |
| 10 | Documentation & comments | No issues found |
## Findings
### Configuration-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:282` |
| Status | Resolved |
**Description:** `sp_PublishGeneration` invokes `EXEC dbo.sp_ValidateDraft @DraftGenerationId = @DraftGenerationId;` and then continues unconditionally to the reservation MERGE and the `Status='Published'` update. `sp_ValidateDraft` signals every failure with `RAISERROR(..., 16, 1)` followed by `RETURN`. A severity-16 `RAISERROR` is not a batch-aborting error and `SET XACT_ABORT ON` does not abort the transaction for it, so control returns to `sp_PublishGeneration`, which publishes the draft even though validation rejected it (cross-cluster namespace binding, dangling tag FKs, duplicate external identifiers, EquipmentUuid immutability all pass through). Pre-publish validation is effectively bypassed.
**Recommendation:** Wrap the `EXEC dbo.sp_ValidateDraft` in `BEGIN TRY ... END TRY BEGIN CATCH ROLLBACK; THROW; END CATCH` so the validation `RAISERROR` propagates and aborts the publish, or have `sp_ValidateDraft` return a result-set/output parameter that `sp_PublishGeneration` inspects and explicitly rolls back on. Add a regression test that publishes a draft with a known violation and asserts it stays `Draft`.
**Resolution:** Resolved 2026-05-22 — wrapped the `EXEC dbo.sp_ValidateDraft` call in `sp_PublishGeneration` in a `BEGIN TRY ... BEGIN CATCH ROLLBACK; THROW; END CATCH` block so a validation `RAISERROR` rolls back the publish transaction and re-raises instead of being silently ignored; added DB-backed regression test `Publish_aborts_when_ValidateDraft_rejects_the_draft`.
### Configuration-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:325` |
| Status | Resolved |
**Description:** `sp_RollbackToGeneration` opens its own `BEGIN TRANSACTION`, clones rows into a new Draft, then `EXEC dbo.sp_PublishGeneration`, which itself runs `BEGIN TRANSACTION` (nesting `@@TRANCOUNT` to 2) and on its failure paths executes a bare `ROLLBACK`. A bare `ROLLBACK` rolls back to the outermost transaction and sets `@@TRANCOUNT` to 0, so when `sp_RollbackToGeneration` later reaches its own `COMMIT` it runs with no open transaction and raises error 3902. The rollback clone is silently discarded and the caller sees a confusing secondary error instead of the real publish failure.
**Recommendation:** Make `sp_PublishGeneration` transaction-nesting aware: capture `@@TRANCOUNT` on entry, only `BEGIN TRANSACTION` when zero (otherwise `SAVE TRANSACTION`), and only `COMMIT`/`ROLLBACK` the level it owns. Alternatively factor the publish body into an inner proc that assumes an ambient transaction.
**Resolution:** Resolved 2026-05-22 — made `sp_PublishGeneration` transaction-nesting aware: captures `@@TRANCOUNT` on entry, issues `BEGIN TRANSACTION` when zero or `SAVE TRANSACTION sp_PublishGeneration` when nested, and uses `ROLLBACK TRANSACTION sp_PublishGeneration` (savepoint rollback) on all failure paths in the nested case so the caller's outer transaction is not wiped; also wrapped `EXEC dbo.sp_ValidateDraft` in `BEGIN TRY ... END CATCH` so validation errors propagate correctly.
### Configuration-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:73` |
| Status | Resolved |
**Description:** `ValidatePathLength` computes path length with hard-coded constants — it always charges 64 chars for Enterprise+Site (`32 + 32 + ...`) regardless of the cluster's actual values. This over-rejects: a short Enterprise/Site is penalised by up to 64 unused chars, so a legitimately under-200-char path can fail `PathTooLong`. The check also silently `continue`s when an equipment's `UnsLineId`/`UnsAreaId` does not resolve, so an orphaned-line path is never length-checked.
**Recommendation:** Pass the actual `Enterprise` and `Site` strings into the validator (e.g. on `DraftSnapshot`, or as parameters alongside `ValidateClusterTopology`) and compute the real length. If the cluster row cannot be supplied, document the check as a conservative upper bound or emit a lower-severity warning rather than a hard error.
**Resolution:** Resolved 2026-05-22 — added nullable `Enterprise` and `Site` properties to `DraftSnapshot`; `ValidatePathLength` uses actual lengths when set and falls back to the conservative 32-char upper bound per segment with a comment explaining the trade-off; `DraftValidationService` now loads the cluster row and populates both properties; added `PathLength_uses_actual_Enterprise_Site_when_provided` and `PathLength_conservative_fallback_when_Enterprise_Site_absent` unit tests.
### Configuration-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Enums/NodePermissions.cs:8`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/OtOpcUaConfigDbContext.cs:417` |
| Status | Open |
**Description:** `NodePermissions` is declared `[Flags] enum ... : uint`, while its XML doc and `NodeAcl.PermissionFlags`' doc both say "stored as int", and `ConfigureNodeAcl` uses `HasConversion<int>()` — a `uint``int` conversion. Only bits 011 are used today, but the underlying-type/storage-type mismatch is a latent trap: a future bit-31 flag yields a `uint` value that overflows `int` and the conversion round-trip would corrupt it.
**Recommendation:** Change the enum underlying type to `int` (consistent with the docs and the conversion). No high bit is in use, so this is the smaller change.
**Resolution:** _(open)_
### Configuration-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs:50` |
| Status | Open |
**Description:** `PutAsync` performs a non-atomic find-then-insert/update. Two concurrent `PutAsync` calls for the same `(ClusterId, GenerationId)` can both observe `existing is null` and both `Insert`, producing two rows for one generation. The constructor's `EnsureIndex` calls are non-unique, so the storage layer does not prevent the duplicate, and `PruneOldGenerationsAsync`'s `keepLatest` accounting is then off.
**Recommendation:** Declare a unique index on `(ClusterId, GenerationId)` and treat the duplicate-key exception as an idempotent no-op, or guard `PutAsync` with an instance `SemaphoreSlim`/lock. Document the concurrency contract on `ILocalConfigCache`.
**Resolution:** _(open)_
### Configuration-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs:79` |
| Status | Resolved |
**Description:** The fallback `catch` filters on `ex is not OperationCanceledException`. A SQL command timeout surfaced by ADO.NET as a `TaskCanceledException` (derives from `OperationCanceledException`) is then treated as caller cancellation and propagates instead of falling back to the sealed cache — the opposite of the documented "fallback on any exception including timeout". The retry `ShouldHandle` predicate has the same shape, so command-timeout cancellations are also not retried consistently.
**Recommendation:** Distinguish caller cancellation from command-timeout cancellation explicitly: inspect `cancellationToken.IsCancellationRequested` to decide whether an `OperationCanceledException` is a genuine cancel (rethrow) or a timeout (fall back). Add unit tests for both a `TimeoutRejectedException` path and a command-timeout `TaskCanceledException` path asserting cache fallback occurs.
**Resolution:** Resolved 2026-05-22 — changed the fallback `catch` filter to `ex is not OperationCanceledException || !cancellationToken.IsCancellationRequested` so a command-timeout `TaskCanceledException` (caller token not cancelled) triggers cache fallback while genuine caller cancellation still propagates; changed the retry `ShouldHandle` predicate to `Handle<Exception>()` (handles all exceptions, relying on Polly's own cancellation-token check to stop retrying on genuine cancellation); added three unit tests: `CommandTimeout_TaskCanceledException_FallsBackToCache`, `PollyTimeout_TimeoutRejectedException_FallsBackToCache`, and `CallerCancellation_Propagates_NotFallback`.
### Configuration-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/GenerationApplier.cs:44` |
| Status | Open |
**Description:** `ApplyPass` wraps each callback in `catch (Exception ex)`. This swallows `OperationCanceledException` — a cancellation during a callback is recorded as just another entity error string and the applier keeps walking the remaining passes instead of stopping. It also masks fatal exceptions. The applier continues applying Added/Modified passes even after a Removed callback failed, leaving a partially-applied runtime state.
**Recommendation:** Rethrow `OperationCanceledException` rather than recording it as an entity error; call `ct.ThrowIfCancellationRequested()` between passes. Document or enforce whether a failed Removed pass should abort before the Added/Modified passes run.
**Resolution:** _(open)_
### Configuration-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:150`, `:373`, `:468` |
| Status | Resolved |
**Description:** Three stored procedures build `ConfigAuditLog.DetailsJson` by raw string concatenation of caller-supplied `nvarchar` parameters: `sp_RegisterNodeGenerationApplied` (`@Status`), `sp_RollbackToGeneration` (`@TargetGenerationId`), `sp_ReleaseExternalIdReservation` (`@Kind`, `@Value`). A value with a double-quote or backslash produces malformed JSON; combined with the `CK_ConfigAuditLog_DetailsJson_IsJson` check constraint, the `INSERT` fails the constraint and aborts the surrounding publish/rollback transaction (denial of operation). It is also a JSON-injection vector that can silently rewrite the audit record's shape.
**Recommendation:** Build the JSON with a safe constructor (`FOR JSON PATH, WITHOUT_ARRAY_WRAPPER` or `JSON_OBJECT(...)` on SQL Server 2022+) so values are properly escaped, or run each interpolated value through `STRING_ESCAPE(@Value, 'json')`. Add tests with quote/backslash-containing inputs.
**Resolution:** Resolved 2026-05-22 — routed every caller-supplied string interpolated into `DetailsJson` through `STRING_ESCAPE(@x, 'json')` (`@Status` in `sp_RegisterNodeGenerationApplied`; `@Kind`/`@Value` in `sp_ReleaseExternalIdReservation`) and emitted `sp_RollbackToGeneration`'s `@TargetGenerationId` as a bare JSON number via explicit `CONVERT(nvarchar(20), CONVERT(bigint, ...))`; added DB-backed regression tests `RegisterNodeGenerationApplied_escapes_quotes_in_audit_DetailsJson` and `ReleaseReservation_escapes_quotes_in_audit_DetailsJson` that round-trip quote/backslash inputs through `ISJSON`/`JSON_VALUE`.
### Configuration-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/DesignTimeDbContextFactory.cs:14` |
| Status | Resolved |
**Description:** `DefaultConnectionString` embeds a plaintext `sa` password with `User Id=sa` directly in source, checked into the repository. Although used only at design time (`dotnet ef`), a checked-in `sa` credential normalises committing DB passwords and, if live for the shared dev SQL Server, grants `sa` to anyone with repo access. `TrustServerCertificate=True` plus `Encrypt=False` additionally disables transport protection for that connection.
**Recommendation:** Drop the embedded credential. Fall back to integrated auth (`Trusted_Connection=True`) or fail fast with a message instructing the developer to set `OTOPCUA_CONFIG_CONNECTION`. Rotate the dev `sa` password if this value is live.
**Resolution:** Resolved 2026-05-22 — removed the embedded `sa` password and `DefaultConnectionString` constant entirely; `CreateDbContext` now throws `InvalidOperationException` with a clear setup message when `OTOPCUA_CONFIG_CONNECTION` is not set, rather than silently falling back to a hardcoded credential; added XML-doc example showing the recommended integrated-auth connection string.
### Configuration-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs:81` |
| Status | Open |
**Description:** On central-DB read failure the warning log records the full exception object. Callers pass arbitrary `centralFetch` delegates; if any delegate closes over a connection string, an exception thrown from it (or a `SqlException` carrying server/credential context) is logged verbatim. There is no scrubbing of connection-string fragments before logging, against the project's no-secret-logging rule.
**Recommendation:** Log `ex.GetType().Name` and `ex.Message` for SQL failures rather than the full exception, or run exception messages through a connection-string scrubber before they reach the log sink.
**Resolution:** _(open)_
### Configuration-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/GenerationApplier.cs:7`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:60` |
| Status | Open |
**Description:** The companion test project covers the cache, schema compliance, stored procedures, and `DraftValidator` well, but two flagged behaviours are not pinned: (a) `GenerationApplier` ordering/cancellation when a Removed callback fails — no test asserts the Added/Modified passes still run or that cancellation aborts; (b) `ValidatePathLength`'s constant 32+32 approximation — no test exercises a long Enterprise/Site. The publish-bypasses-validation bug (Configuration-001) is also untested against the live SQL fixture.
**Recommendation:** Add `GenerationApplierTests` cases for a throwing callback (assert error recorded, assert cancellation propagates) and a `DraftValidatorTests` path-length boundary case. Add a `StoredProceduresTests` case that publishes an invalid draft and asserts it stays `Draft`.
**Resolution:** _(open)_
+156
View File
@@ -0,0 +1,156 @@
# Code Review — Core.Abstractions
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.Abstractions-001, Core.Abstractions-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.Abstractions-003, Core.Abstractions-004 |
| 4 | Error handling & resilience | Core.Abstractions-005 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | Core.Abstractions-006 |
| 9 | Testing coverage | Core.Abstractions-007 |
| 10 | Documentation & comments | Core.Abstractions-008 |
## Findings
### Core.Abstractions-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:112` |
| Status | Resolved |
**Description:** `PollOnceAsync` detects a change with `!Equals(lastSeen?.Value, current.Value)`. `object.Equals` falls back to reference equality for reference types that do not override it — including `T[]` array values. The capability interfaces explicitly support 1-D array attributes (`DriverAttributeInfo.IsArray`, `ValueRank=1`), and a driver's batch reader produces a fresh array instance on every poll. As a result every poll of an array-valued tag is treated as a change, so `OnDataChange` fires on every tick regardless of whether the array contents actually changed. This produces spurious data-change notifications and unnecessary OPC UA monitored-item publishes for any poll-based driver (Modbus, S7, AB CIP, FOCAS) that exposes array tags.
**Recommendation:** Compare array values structurally — e.g. when both `lastSeen?.Value` and `current.Value` are arrays, compare with `StructuralComparisons.StructuralEqualityComparer.Equals` (or element-wise) — instead of relying on `object.Equals`. Add a test covering an array-valued tag whose contents are unchanged across polls.
**Resolution:** Resolved 2026-05-22 — introduced `ValuesAreDifferent` helper in `PollGroupEngine` that uses `StructuralComparisons.StructuralEqualityComparer` for `Array` values, falling back to `object.Equals` for scalars; added `Array_valued_tag_unchanged_contents_raises_only_once` and `Array_valued_tag_changed_contents_raises_event` tests.
### Core.Abstractions-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:105-109` |
| Status | Resolved |
**Description:** `PollOnceAsync` iterates `state.TagReferences` and indexes the reader's result with `snapshots[i]`, assuming the driver-supplied `_reader` delegate returns exactly one snapshot per input reference in input order. The contract is documented (ctor XML doc: "snapshots MUST be returned in the same order as the input references"), but it is never validated. A reader that returns a shorter list — a plausible driver bug, or a partial result on a backend error — throws `ArgumentOutOfRangeException` from the indexer. That exception escapes `PollOnceAsync`, is swallowed by the catch-all in `PollLoopAsync` (line 99), and the subscription then silently produces no further `OnDataChange` callbacks for the rest of its lifetime with no diagnostic. The failure mode is a permanently stalled subscription that looks healthy.
**Recommendation:** Validate `snapshots.Count == state.TagReferences.Count` at the top of `PollOnceAsync` and throw a descriptive exception (or skip the tick with a logged diagnostic) so the contract violation is visible rather than silently degrading. Consider surfacing repeated reader-contract failures through a callback the driver can route to its health surface.
**Resolution:** Resolved 2026-05-22 — added count-guard at the top of `PollOnceAsync` that throws `InvalidOperationException` with a descriptive message when the reader returns the wrong number of snapshots; added `Reader_short_result_list_raises_descriptive_exception_and_loop_continues` test verifying the loop survives contract violations and resumes delivering events once the reader recovers.
### Core.Abstractions-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:64,121-130` |
| Status | Resolved |
**Description:** `Subscribe` starts the poll loop with a fire-and-forget `Task.Run` and keeps no reference to the returned `Task`. Neither `Unsubscribe` nor `DisposeAsync` awaits the loop's completion — they only cancel the `CancellationTokenSource` and dispose it. Two consequences:
1. After `DisposeAsync`/`Unsubscribe` returns, a poll already in flight inside `PollOnceAsync` can still complete and invoke the `_onChange` callback. A driver that disposes the engine during shutdown can therefore receive a data-change callback after it considers the engine torn down, with no way to know the engine is gone.
2. `Unsubscribe`/`DisposeAsync` call `state.Cts.Dispose()` immediately while the loop may still be inside `Task.Delay(state.Interval, ct)`. Cancelling-then-disposing a CTS while a consumer still touches the token can race; `Task.Delay` on a disposed token can throw `ObjectDisposedException` rather than `OperationCanceledException`, which the `Task.Delay` await in `PollLoopAsync` does not catch (it catches only `OperationCanceledException`).
**Recommendation:** Track each loop `Task` in `SubscriptionState` and await it (with a timeout) in `Unsubscribe`/`DisposeAsync` before disposing the CTS, so disposal is deterministic and no callback can fire after teardown. At minimum, defer `Cts.Dispose()` until the loop task has observed cancellation, or wrap the `Task.Delay` await to also tolerate `ObjectDisposedException`.
**Resolution:** Resolved 2026-05-22 — stored the loop `Task` in `SubscriptionState.LoopTask`; `Unsubscribe` calls `StopState` which cancels then awaits the task (5 s timeout) before disposing the CTS; `DisposeAsync` cancels all loops in parallel then awaits them all via `Task.WhenAll` with a 5 s timeout before disposing CTSs, making teardown deterministic and preventing post-disposal callbacks.
### Core.Abstractions-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs:23-40` |
| Status | Open |
**Description:** `Register` performs a check-then-act sequence (`snapshot.ContainsKey` then build `next` then `Interlocked.Exchange`) that is not atomic. Two threads registering concurrently can both pass the duplicate check and both build a `next` dictionary; the second `Interlocked.Exchange` then wins and silently discards the first registration, defeating the documented "registered only once" guarantee. The class XML doc states registration happens single-threaded at startup, so this is not a live defect — but the use of `Interlocked.Exchange` for the swap implies the type is fully thread-safe for writers, which it is not. The mismatch between the implementation's apparent intent and its actual guarantee is a maintenance hazard.
**Recommendation:** Either guard `Register` with a `lock` so the check-build-swap is atomic, or strengthen the XML `Thread-safety` remark to state explicitly that concurrent `Register` calls are unsupported and only reader/writer concurrency is safe.
**Resolution:** _(open)_
### Core.Abstractions-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:90,99` |
| Status | Open |
**Description:** Both the initial-poll and steady-state catch blocks use a bare `catch { }` that swallows every exception type, including non-transient programmer errors such as `NullReferenceException` and `ArgumentOutOfRangeException` (see Core.Abstractions-002). The XML remark says "transient poll error — loop continues, driver health surface logs it", but the engine never actually notifies the driver — there is no callback or event for a caught exception, so the driver's health surface has nothing to log. A persistently failing reader produces a silently spinning loop with zero observability from inside this module.
**Recommendation:** Narrow the catch to the exception types a reader is expected to throw (or at least exclude obviously-fatal ones), and add an optional `Action<Exception>` error callback (or raise an event) so the owning driver can record poll failures on its health surface as the doc claims happens.
**Resolution:** _(open)_
### Core.Abstractions-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:63,84-86`, `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/Historian/IHistorianDataSource.cs:30,63` |
| Status | Open |
**Description:** The two history-read surfaces use inconsistent integer types for the same "maximum rows" concept. `IHistoryProvider.ReadRawAsync` and `IHistorianDataSource.ReadRawAsync` take `uint maxValuesPerNode`, but `ReadEventsAsync` (on both interfaces) takes `int maxEvents`. The OPC UA `HistoryRead` service request fields are unsigned, and a negative `maxEvents` has no defined meaning. Mixing `int` and `uint` for the same parameter role across sibling methods forces every caller and implementer to reason about the inconsistency and risks accidental sign issues at the boundary.
**Recommendation:** Standardize on `uint` for all max-rows parameters across both `IHistoryProvider` and `IHistorianDataSource` (or document explicitly why `maxEvents` differs).
**Resolution:** _(open)_
### Core.Abstractions-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests/PollGroupEngineTests.cs` |
| Status | Open |
**Description:** `PollGroupEngine` is the only behavioural (non-DTO) type in the module and its tests, while solid for the happy paths, miss two paths that this review identifies as defect-prone: (a) no test exercises an array-valued tag whose contents are unchanged across polls (would catch Core.Abstractions-001), and (b) no test exercises a reader that returns a snapshot list shorter than the input references (would catch Core.Abstractions-002). The `Reader_exception_does_not_crash_loop` test only covers a reader that throws before producing any result. `DataValueSnapshot` change-detection semantics for reference-typed values are therefore unverified.
**Recommendation:** Add tests for the unchanged-array case and the short-result-list case once Core.Abstractions-001/002 are addressed, so the intended contract is locked down.
**Resolution:** _(open)_
### Core.Abstractions-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverHealth.cs:9`, `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:39-43,65-69` |
| Status | Open |
**Description:** Two XML-doc inaccuracies:
1. `DriverHealth.LastError` is documented as "Most recent error message; null when state is Healthy." The `DriverState` enum also defines `Degraded`, `Reconnecting`, and `Faulted` states, all of which carry an error; and a driver in `Healthy` state may legitimately retain the last error from a previously-recovered failure. The "null when Healthy" claim is stronger than the type enforces and than callers should rely on.
2. `IHistoryProvider.ReadAtTimeAsync` and `ReadEventsAsync` are C# default interface methods whose `<remarks>` say "Default implementation throws". This is accurate, but the sibling `IHistorianDataSource` declares the same methods as required (non-default) members — the asymmetry between the two history surfaces is undocumented and could surprise an implementer who assumes parity.
**Recommendation:** Reword `DriverHealth.LastError` to "Most recent error message; may be null when no error has been recorded" without tying nullness to a specific state. Add a one-line note on `IHistoryProvider`/`IHistorianDataSource` explaining why one surface uses default methods and the other does not.
**Resolution:** _(open)_
@@ -0,0 +1,192 @@
# Code Review — Core.AlarmHistorian
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 2 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.AlarmHistorian-001, Core.AlarmHistorian-002 |
| 2 | OtOpcUa conventions | Core.AlarmHistorian-003 |
| 3 | Concurrency & thread safety | Core.AlarmHistorian-004, Core.AlarmHistorian-005 |
| 4 | Error handling & resilience | Core.AlarmHistorian-006, Core.AlarmHistorian-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.AlarmHistorian-008 |
| 7 | Design-document adherence | Core.AlarmHistorian-009 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Core.AlarmHistorian-010 |
| 10 | Documentation & comments | Core.AlarmHistorian-011 |
## Findings
### Core.AlarmHistorian-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:255-278` |
| Status | Resolved |
**Description:** `ReadBatch` builds two parallel lists, `rowIds` and `events`, that `DrainOnceAsync` later indexes together (`rowIds[i]` paired with `outcomes[i]`, where `outcomes` is 1:1 with `events`). But `rowIds.Add(reader.GetInt64(0))` runs unconditionally for every row, while `events.Add(evt)` is guarded by `if (evt is not null)`. If `JsonSerializer.Deserialize<AlarmHistorianEvent>` returns `null` for any row (corrupt or empty payload), `rowIds` gains an entry but `events` does not. The writer then returns `outcomes.Count == events.Count`, which passes the `outcomes.Count != events.Count` guard, and the per-row loop applies each outcome to `rowIds[i]` — every row from the skipped index onward is mapped to the wrong event's outcome. An `Ack` can delete a row whose event was never sent to the historian (silent alarm-event data loss), and a `PermanentFail` can dead-letter an unrelated good row. The corrupt row itself is never advanced and is re-read on every drain forever, permanently stalling the queue head.
**Recommendation:** Keep `rowIds` and `events` strictly aligned. Either skip the `rowId` when deserialization returns `null`, or — better — treat a `null`/failed deserialization as an immediate dead-letter for that specific `RowId` (it can never succeed) and exclude it from the batch passed to the writer. Carry the `rowId` inside a single list of `(long RowId, AlarmHistorianEvent Event)` tuples so the two can never drift.
**Resolution:** Resolved 2026-05-22 — `ReadBatch` now returns a single list of `QueueRow(long RowId, AlarmHistorianEvent? Event)` records so a rowId can never drift from its event; `DrainOnceAsync` immediately dead-letters rows whose payload is null/un-deserializable (also catching `JsonException`) and forwards only well-formed events to the writer, mapping outcomes by `liveRows[i].RowId`. Regression tests `Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned` and `Drain_with_corrupt_head_row_does_not_stall_queue` added.
### Core.AlarmHistorian-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:99-105,386-388` |
| Status | Resolved |
**Description:** The class computes an exponential-backoff value (`_backoffIndex`, `BumpBackoff`, `CurrentBackoff`, the `BackoffLadder`) and the class doc-comment states "Drain runs on a shared `Timer`. Exponential backoff on `RetryPlease`: 1s → 2s → 5s → 15s → 60s cap." However `StartDrainLoop` creates the `Timer` with a fixed `tickInterval` for both due-time and period and never reschedules it. `CurrentBackoff` is computed but never consulted by the timer, so the drain loop keeps hammering the historian at the fixed cadence regardless of `BackingOff` state. The documented backoff behavior does not exist for the production drain path — it is only observable via the `CurrentBackoff` property in tests.
**Recommendation:** Make the drain loop honor the backoff. Either switch to a self-rescheduling one-shot timer that sets its next due-time to `max(tickInterval, CurrentBackoff)` after each `DrainOnceAsync`, or have `DrainOnceAsync` skip the writer call while still inside the backoff window (track `_nextEligibleDrainUtc`). Update the doc-comment if the design intentionally changes.
**Resolution:** Resolved 2026-05-22 — `StartDrainLoop` now arms a self-rescheduling one-shot `Timer`; `RescheduleDrain` sets the next due-time to `max(tickInterval, CurrentBackoff)` while `_drainState` is `BackingOff` so a historian outage genuinely slows the cadence down the ladder. Class doc-comment updated. Regression tests `StartDrainLoop_honors_backoff_and_slows_cadence_under_retry` and `StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy` added.
### Core.AlarmHistorian-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | OtOpcUa conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,218-243,246-253` |
| Status | Resolved |
**Description:** `EnqueueAsync` is declared `async`-shaped (`Task EnqueueAsync(...)`) and the `IAlarmHistorianSink` contract explicitly states "the sink MUST NOT block the emitting thread … `EnqueueAsync` returns as soon as the queue row is committed." But the implementation does fully synchronous, blocking SQLite I/O (`conn.Open()`, `EnforceCapacity`, `cmd.ExecuteNonQuery()`) on the caller's thread and only then returns `Task.CompletedTask`. Under SQLite write contention with the drain worker this blocks the alarm-emitting thread for the full lock-wait. The same synchronous-work-behind-an-async-or-status-API pattern applies to `GetStatus` (called from the Admin UI / `/healthz` request thread) and `RetryDeadLettered`. The `cancellationToken` parameter of `EnqueueAsync` is accepted and ignored.
**Recommendation:** Either make the I/O genuinely asynchronous (`await conn.OpenAsync(ct)`, `await cmd.ExecuteNonQueryAsync(ct)``Microsoft.Data.Sqlite` supports the async surface), or change `EnqueueAsync` to an in-memory hand-off (e.g. a `Channel`) drained by a background writer so the emitting thread truly never touches the database. At minimum honor the `cancellationToken` parameter.
**Resolution:** Resolved 2026-05-22 — `EnqueueAsync` now uses `OpenAsync` / `ExecuteNonQueryAsync` / `ExecuteScalarAsync` throughout (capacity check included); `ApplyPragmasAsync` handles the WAL/busy-timeout PRAGMA on the async path; `cancellationToken` is threaded through every await so cancellation is honoured.
### Core.AlarmHistorian-004
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:90,112,176,259` |
| Status | Resolved |
**Description:** Every operation opens a brand-new `SqliteConnection` from the bare connection string `Data Source={databasePath}` — no `busy_timeout` / `Pragma`, no shared cache. SQLite serializes writers with a file lock; when `EnqueueAsync` (emitting thread) and `DrainOnceAsync` (timer thread) collide, the loser gets an immediate `SQLITE_BUSY` exception because the default busy timeout is 0. In `DrainOnceAsync` the `BeginTransaction()` / `Commit()` block can fail mid-drain with `SQLITE_BUSY`; the exception escapes the `try` (it is not the writer-call `try`), the `finally` releases the gate, and the row outcomes are lost / partially applied. The class doc-comment claims `DrainOnceAsync` is "Safe to call from multiple threads" but the concurrent enqueue-vs-drain case is not actually safe against busy errors.
**Recommendation:** Configure a non-zero busy timeout — `SqliteConnectionStringBuilder { DataSource = databasePath, DefaultTimeout = 5 }` plus `PRAGMA busy_timeout=5000` on open. Strongly consider WAL journal mode (`PRAGMA journal_mode=WAL`) so readers and the writer do not block each other. Reuse a single long-lived write connection guarded by `_drainGate` rather than opening/closing per call.
**Resolution:** Resolved 2026-05-22 — the connection string is now built via `SqliteConnectionStringBuilder` with `DefaultTimeout = 5`, and every connection is opened through a new `OpenConnection` helper that applies `PRAGMA busy_timeout=5000` and `PRAGMA journal_mode=WAL` so an enqueue/drain lock collision waits the lock out instead of throwing `SQLITE_BUSY`. All eight call sites switched to the helper. Regression test `Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy` added.
### Core.AlarmHistorian-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:66-71,141-143,199,386-388` |
| Status | Resolved |
**Description:** The mutable status fields `_lastDrainUtc`, `_lastSuccessUtc`, `_lastError`, `_drainState`, and `_backoffIndex` are written by the drain timer thread inside `DrainOnceAsync` and read concurrently by `GetStatus()` / `CurrentBackoff` on Admin-UI / health-check threads with no memory barrier (no `lock`, no `volatile`, no `Interlocked`). `DateTime?` is not guaranteed to be written atomically, and the reader can observe a stale or torn value. This is a diagnostics surface so the impact is limited, but a torn `DateTime?` read is real undefined behavior.
**Recommendation:** Guard the status fields with a small lock, or make the scalars `volatile` where the type permits and snapshot `DateTime?` values under a lock. Take the snapshot atomically in `GetStatus()`.
**Resolution:** Resolved 2026-05-22 — added `_statusLock` object; all writes to `_lastDrainUtc`, `_lastSuccessUtc`, `_lastError`, `_drainState`, and `_evictedCount` (new) now happen inside `lock (_statusLock)` blocks; `GetStatus()` snapshots all fields atomically under the same lock. Regression test `GetStatus_snapshot_is_consistent_under_concurrent_drain` added.
### Core.AlarmHistorian-006
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:103,135-216` |
| Status | Resolved |
**Description:** `StartDrainLoop` launches the drain with `new Timer(_ => _ = DrainOnceAsync(CancellationToken.None), ...)`. The returned `Task` is discarded (`_ =`), so any exception thrown by `DrainOnceAsync` is an unobserved task exception — never logged, never surfaced. Several paths in `DrainOnceAsync` can throw: the `outcomes.Count != events.Count` guard (`InvalidOperationException`), `JsonSerializer.Deserialize` on a malformed payload, `PurgeAgedDeadLetters` / `ReadBatch` / the commit block hitting `SQLITE_BUSY` or a schema error. When any of these throw, the drain silently stops making progress for that tick, `_drainState` is left stale (still `Draining`), and an operator watching the Admin UI sees no error. A persistently failing condition produces a silent, permanently stalled queue.
**Recommendation:** Wrap the timer callback body in a `try/catch` that logs the exception via `_logger.Error`, records it into `_lastError`, and resets `_drainState` so the diagnostics surface reflects the failure. Do not discard the `Task` without an attached continuation that observes faults.
**Resolution:** Resolved 2026-05-22 — the timer no longer discards the drain `Task`. A dedicated `DrainTimerCallback` `await`s `DrainOnceAsync` inside a `try/catch` that logs the fault via `_logger.Error`, records it into `_lastError`, and sets `_drainState = BackingOff` so the failure is visible on the `GetStatus` surface; a `finally` always re-arms the timer so a faulting tick can never permanently stall the queue. Regression test `StartDrainLoop_records_drain_fault_and_keeps_running` added.
### Core.AlarmHistorian-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:172-174` |
| Status | Resolved |
**Description:** When the writer returns a wrong-cardinality result, the code throws `InvalidOperationException` after `WriteBatchAsync` has already succeeded. The events were potentially delivered to the historian, but no rows are deleted or dead-lettered, `_drainState` is left at `Draining`, and the backoff is not bumped. Combined with Core.AlarmHistorian-006 the exception is then swallowed. On the next drain the same batch is re-sent — if the writer actually delivered the events the first time, this produces duplicate historian rows; if it is a deterministic writer bug the queue stalls forever.
**Recommendation:** Treat a cardinality mismatch as a transient batch failure: log it, set `_lastError`, bump backoff, set `_drainState = BackingOff`, and return without throwing — mirroring the writer-exception path at lines 162-170. A deterministic writer contract violation should also raise an operator-visible alert rather than silently looping.
**Resolution:** Resolved 2026-05-22 — the `throw InvalidOperationException` replaced with log-and-backoff: mismatch is recorded into `_lastError`, `_drainState = BackingOff`, backoff is bumped, and the method returns without applying any outcomes — rows stay queued for the next drain attempt. Regression test `Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows` added.
### Core.AlarmHistorian-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,255-278` |
| Status | Open |
**Description:** Each `EnqueueAsync` (one per alarm transition — a hot path on a busy plant) opens a connection, runs `EnforceCapacity` (a `COUNT(*)` over the queue table on every single enqueue), serializes JSON, inserts, and closes the connection. The unconditional `COUNT(*)` on every enqueue is an avoidable scan; the open/close churn defeats connection pooling benefits and adds lock-acquisition overhead per event. `DrainOnceAsync` similarly opens three separate connections per tick (`PurgeAgedDeadLetters`, `ReadBatch`, the transaction block).
**Recommendation:** Reuse a single pooled write connection. Replace the per-enqueue `COUNT(*)` with a periodic capacity check (every Nth enqueue, or piggy-backed on the drain tick), or maintain an in-memory approximate counter. Combine the drain-tick connections into one.
**Resolution:** _(open)_
### Core.AlarmHistorian-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:317-347` |
| Status | Resolved |
**Description:** `docs/AlarmTracking.md` and the `IAlarmHistorianSink` contract present the SQLite queue as the durability guarantee — "Durably enqueue the event", "operator acks never block on the historian being reachable". But `EnforceCapacity` silently deletes the oldest non-dead-lettered (not-yet-sent) rows when the queue reaches `DefaultCapacity` (1,000,000). Those are alarm-event records that were accepted as durably queued and are then dropped before ever reaching the historian — silent alarm-history data loss under sustained historian outage. The only signal is a `WARN` log line. Neither `docs/AlarmTracking.md` nor the sink's XML doc mentions that the durability guarantee is bounded, and there is no metric/dead-letter trail for evicted rows.
**Recommendation:** At minimum document the bounded-durability behavior in `docs/AlarmTracking.md` and the `IAlarmHistorianSink` summary. Better: surface evicted-row counts in `HistorianSinkStatus` (a dedicated counter) so the loss is operator-visible, and consider routing overflow to the dead-letter table instead of hard-deleting it so the records survive for post-mortem within the retention window.
**Resolution:** Resolved 2026-05-22 — added `EvictedCount` (default 0) to `HistorianSinkStatus` with full param-tag documentation; `EnforceCapacity` and `EnforceCapacityAsync` now increment `_evictedCount` (guarded by `_statusLock`) and include the lifetime total in the WARN log; `docs/AlarmTracking.md` documents the bounded-durability caveat and the `EvictedCount` surface. Regression test `Capacity_eviction_increments_evicted_count_on_status` added.
### Core.AlarmHistorian-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs` |
| Status | Resolved |
**Description:** The test suite covers the happy paths well (Ack/Retry/PermanentFail, capacity eviction, retention purge, ctor validation) but leaves critical paths untested: (a) no test exercises a corrupt / `null`-deserializing `PayloadJson` row, so the `rowIds`/`events` misalignment bug (Core.AlarmHistorian-001) was not caught; (b) no test for `StartDrainLoop` actually running on the timer, nor for the backoff being honored by the schedule (Core.AlarmHistorian-002); (c) no concurrency test running `EnqueueAsync` and `DrainOnceAsync` in parallel, which is the exact scenario that triggers `SQLITE_BUSY` (Core.AlarmHistorian-004); (d) no test for the `outcomes.Count != events.Count` cardinality-mismatch branch (Core.AlarmHistorian-007).
**Recommendation:** Add tests for: a corrupt payload row (insert raw bad JSON via a direct SQLite write, then drain and assert the correct row is dead-lettered and others are unaffected); a `FakeWriter` returning a wrong-length outcome list; a parallel enqueue/drain stress test; and the timer-driven `StartDrainLoop` path.
**Resolution:** Resolved 2026-05-22 — (a) `Drain_with_corrupt_payload_row_deadletters_it_and_keeps_good_rows_aligned` and `Drain_with_corrupt_head_row_does_not_stall_queue` cover corrupt payloads; (b) `StartDrainLoop_honors_backoff_and_slows_cadence_under_retry`, `StartDrainLoop_keeps_steady_cadence_when_writer_is_healthy`, and `StartDrainLoop_records_drain_fault_and_keeps_running` cover the timer-driven path; (c) `Concurrent_enqueue_and_drain_do_not_throw_sqlite_busy` covers the concurrent stress path; (d) `Writer_returning_wrong_cardinality_outcomes_sets_backing_off_and_keeps_rows` covers the cardinality-mismatch branch. Additionally `Capacity_eviction_increments_evicted_count_on_status` and `GetStatus_snapshot_is_consistent_under_concurrent_drain` cover -009 and -005 respectively.
### Core.AlarmHistorian-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs:5-9,76`, `AlarmHistorianEvent.cs:20` |
| Status | Open |
**Description:** Several doc-comments reference the retired v1 architecture. The `IAlarmHistorianSink` summary says ingestion "routes through Galaxy.Host's pipe" and `IAlarmHistorianWriter` says "Stream G wires this to the Galaxy.Host IPC client", but `docs/AlarmTracking.md` and `CLAUDE.md` state the legacy `Galaxy.Host` project was retired in PR 7.2 and the write path is now the Wonderware historian sidecar (`WonderwareHistorianClient`). `AlarmHistorianEvent.cs:20` likewise says "the Galaxy.Host handler maps to the historian's enum on the wire." These stale references will mislead a reader about where the writer is actually hosted.
**Recommendation:** Update the doc-comments to refer to the Wonderware historian sidecar / `WonderwareHistorianClient` (`IAlarmHistorianWriter` implementation) instead of `Galaxy.Host`, consistent with `docs/AlarmTracking.md`'s "Historian write-back" section.
**Resolution:** _(open)_
@@ -0,0 +1,210 @@
# Code Review — Core.ScriptedAlarms
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.ScriptedAlarms-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.ScriptedAlarms-001, Core.ScriptedAlarms-004, Core.ScriptedAlarms-005, Core.ScriptedAlarms-006 |
| 4 | Error handling & resilience | Core.ScriptedAlarms-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.ScriptedAlarms-008, Core.ScriptedAlarms-009 |
| 7 | Design-document adherence | Core.ScriptedAlarms-010 |
| 8 | Code organization & conventions | Core.ScriptedAlarms-011 |
| 9 | Testing coverage | Core.ScriptedAlarms-012 |
| 10 | Documentation & comments | Core.ScriptedAlarms-003 |
## Findings
### Core.ScriptedAlarms-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368` |
| Status | Resolved |
**Description:** `_alarms` is a plain `Dictionary<string, AlarmState>` (line 42). Every mutation of it (`LoadAsync`, `ApplyAsync`, `ReevaluateAsync`, `ShelvingCheckAsync`) correctly happens under the `_evalGate` semaphore, but four read paths touch it with no synchronisation: `GetState` (line 175), `GetAllStates` (line 178-179), the `LoadedAlarmIds` property (line 73), and `RunShelvingCheck` (line 368, `_alarms.Keys.ToArray()`). `RunShelvingCheck` fires from a `Timer` thread-pool callback and can run concurrently with an `ApplyAsync`/`ReevaluateAsync` that is reassigning a dictionary entry. `Dictionary` is not safe for concurrent read while another thread writes — even a value reassignment can be observed mid-rehash and throw `InvalidOperationException` or return torn state. `GetState`/`GetAllStates` are documented as being used by the Admin UI status page, so these reads come from arbitrary request threads.
**Recommendation:** Either switch `_alarms` to `ConcurrentDictionary<string, AlarmState>` (entry reassignment via `_alarms[id] = ...` is already the only write shape, which a `ConcurrentDictionary` supports atomically), or acquire `_evalGate` in every reader. A `ConcurrentDictionary` is the lighter change and matches `_valueCache`, which is already concurrent.
**Resolution:** Resolved 2026-05-22 — switched `_alarms` to `ConcurrentDictionary<string, AlarmState>` so the four unguarded read paths are safe against concurrent under-gate entry reassignment; added a concurrency regression test.
### Core.ScriptedAlarms-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90` |
| Status | Resolved |
**Description:** `LoadAsync` is written to be re-callable — it begins by calling `UnsubscribeFromUpstream()`, `_alarms.Clear()`, and `_alarmsReferencing.Clear()` (lines 90-92), which only makes sense if a reload is supported. But at line 162 it unconditionally assigns `_shelvingTimer = new Timer(...)` without disposing the timer created by a previous `LoadAsync` call. A second `LoadAsync` therefore leaks the old `Timer` and leaves two timers running concurrently against the same `_alarms`/`_evalGate`. The old timer's `RunShelvingCheck` keeps firing forever.
**Recommendation:** Dispose any existing `_shelvingTimer` before reassigning it, e.g. `_shelvingTimer?.Dispose();` immediately before line 162, inside the `_evalGate` critical section. If reload is genuinely not supported, instead guard `LoadAsync` against a second call and document it as one-shot.
**Resolution:** Resolved 2026-05-22 — added `_shelvingTimer?.Dispose()` before the timer reassignment in `LoadAsync` so a second load call does not leak the previous timer.
### Core.ScriptedAlarms-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` |
| Status | Open |
**Description:** `docs/ScriptedAlarms.md` (Composition step 3) and the `OnUpstreamChange` comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the `OnEvent` emission path as non-blocking / fire-and-forget. In the code, `EmitEvent` invokes `OnEvent?.Invoke(this, evt)` **synchronously while `_evalGate` is held** (called from `EvaluatePredicateToStateAsync` line 305 and `ApplyAsync` line 217, both inside the gate). A slow subscriber blocks the single evaluation gate for all alarms; a subscriber that re-enters the engine (e.g. calls `AcknowledgeAsync`) deadlocks because `_evalGate` is a non-reentrant `SemaphoreSlim(1,1)`. The behaviour is defensible (the historian sink is non-blocking, per the doc), but the comments/doc are misleading about where the work happens and the re-entrancy hazard is undocumented.
**Recommendation:** Either move `EmitEvent` outside the `_evalGate` critical section (collect emissions during the locked section and raise them after `Release()`), or document explicitly on `OnEvent` that handlers run under the engine lock, must be fast, and must never call back into the engine.
**Resolution:** _(open)_
### Core.ScriptedAlarms-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234` |
| Status | Resolved |
**Description:** During `LoadAsync`, `_upstream.SubscribeTag(path, OnUpstreamChange)` is called inside the `_evalGate` critical section (line 142). If an upstream implementation delivers an initial value synchronously from inside `SubscribeTag` (a common pattern, and the `ITagUpstreamSource` contract does not forbid it), the observer callback `OnUpstreamChange` runs on the calling thread, schedules `ReevaluateAsync`, which calls `_evalGate.WaitAsync`. That does not deadlock (the reevaluation task simply blocks until `LoadAsync` releases the gate), but it can cause a re-evaluation to run against a half-initialised `_alarms`/index, and the value written to `_valueCache` on line 141 may be immediately overwritten by the subscription's synchronous push with no defined ordering. The cold-start guard partly masks this, but the ordering between the seed read (line 141) and the subscription push is unspecified and may seed a stale value.
**Recommendation:** Subscribe to all upstream tags after the seed reads and after `_loaded = true`, or capture the subscription's first push into the cache and treat `SubscribeTag` as the single source of truth (drop the separate `ReadTag` seed). Document the expected `ITagUpstreamSource` delivery semantics (does `SubscribeTag` push an initial value?).
**Resolution:** Resolved 2026-05-22 — split the seed/subscribe loop: `ReadTag` seeds `_valueCache`, persisted-state restore runs, `_loaded = true` is set, then `SubscribeTag` is called; any synchronous initial push now arrives after `_alarms` is fully initialised and correctly queues behind the gate.
### Core.ScriptedAlarms-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424` |
| Status | Resolved |
**Description:** `Dispose` sets `_disposed = true`, disposes `_shelvingTimer`, and clears `_alarms`. A `RunShelvingCheck` callback already in flight on a thread-pool thread can have passed its `if (_disposed) return;` check (line 367) before `Dispose` ran, then proceed into `ShelvingCheckAsync`, which awaits `_evalGate` and mutates `_alarms` — concurrently with `Dispose`'s `_alarms.Clear()` at line 422 (which runs outside `_evalGate`). `Timer.Dispose()` does not wait for the running callback to finish. The result is a possible `InvalidOperationException` from a dictionary mutated during enumeration, or a save of stale state to the store after dispose. The same applies to a `ReevaluateAsync` in flight from a late upstream push.
**Recommendation:** Use `Timer.Dispose(WaitHandle)` (or `DisposeAsync`) to wait for the callback to drain, and perform `_alarms.Clear()` under `_evalGate` (or simply drop the clear — the object is being discarded). Also have `ShelvingCheckAsync`/`ReevaluateAsync` re-check `_disposed` after acquiring the gate before mutating/saving.
**Resolution:** Resolved 2026-05-22 — added `_disposed` re-checks in `ReevaluateAsync` and `ShelvingCheckAsync` after acquiring `_evalGate` so late callbacks bail out cleanly; dropped the unsynchronised `_alarms.Clear()` from `Dispose` since the object is being discarded and the clear raced concurrent reads.
### Core.ScriptedAlarms-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` |
| Status | Open |
**Description:** `OnUpstreamChange` and `RunShelvingCheck` both launch fire-and-forget tasks (`_ = ReevaluateAsync(...)`, `_ = ShelvingCheckAsync(...)`) with `CancellationToken.None`. There is no tracking of these in-flight tasks, so `Dispose` cannot await them and a server shutdown can race a still-running re-evaluation that writes to the (possibly disposed) store. Combined with finding 005, an upstream push arriving during shutdown produces an unobserved background task touching torn state.
**Recommendation:** Track outstanding background tasks (or use a single serialised worker / `Channel`), and link them to a `CancellationTokenSource` that `Dispose` cancels and drains. At minimum, await the in-flight work in `Dispose`.
**Resolution:** _(open)_
### Core.ScriptedAlarms-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387` |
| Status | Resolved |
**Description:** Every state mutation calls `await _store.SaveAsync(...)` and relies on it succeeding. If the production SQL-backed `IAlarmStateStore` (Stream E) throws — transient SQL outage, deadlock, timeout — the exception propagates: in `ApplyAsync` it surfaces to the Part 9 method caller *after* the in-memory `_alarms` entry was already updated (line 215 runs before the save on line 216), leaving the in-memory state and the persisted state divergent; in `ReevaluateAsync`/`ShelvingCheckAsync` it is caught and logged, but again the in-memory `_alarms` entry was already advanced (lines 250/386) so the persisted store silently falls behind the live state. After a restart, startup recovery reloads the stale persisted state and operators can see a re-raised or re-ackable alarm. The docs claim "the store's view is always consistent with the in-memory state" (`docs/ScriptedAlarms.md` State persistence) — that invariant is not actually enforced.
**Recommendation:** Save before committing the in-memory update, or roll back the in-memory entry if `SaveAsync` fails, so the two never diverge. Classify transient store failures and retry, and surface a hard error/health-degraded signal if persistence is permanently failing rather than silently logging and continuing.
**Resolution:** Resolved 2026-05-22 — reordered `SaveAsync`/`_alarms[id]=` in `ApplyAsync`, `ReevaluateAsync`, and `ShelvingCheckAsync` so persistence happens before the in-memory update; a store failure now leaves both views at the prior state rather than diverging.
### Core.ScriptedAlarms-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Part9StateMachine.cs:261-268` |
| Status | Open |
**Description:** `AppendComment` copies the entire existing comment list into a new `List` on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The `Comments` list is append-only and unbounded — for a long-lived alarm that is acknowledged/commented hundreds of times, every transition is an O(n) copy and the full history is also re-serialised to the store on every `SaveAsync`. Over a multi-month uptime this is a slowly growing per-transition cost.
**Recommendation:** Acceptable for now given audit requirements, but consider an immutable persistent list / `ImmutableList<AlarmComment>` to make append O(log n), or have the store persist comments incrementally (append-only audit table) rather than rewriting the whole collection each save. At minimum, note the unbounded-growth characteristic in the design doc.
**Resolution:** _(open)_
### Core.ScriptedAlarms-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
| Status | Open |
**Description:** `BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. `AlarmPredicateContext` is also newly constructed each evaluation (line 281).
**Recommendation:** Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under `_evalGate`, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in `docs/v2/Galaxy.Performance.md` if alarm evaluation is ever soak-tested.
**Resolution:** _(open)_
### Core.ScriptedAlarms-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` |
| Status | Open |
**Description:** Quality handling is inconsistent across the three places that inspect a `DataValueSnapshot.StatusCode`. `AreInputsReady` (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predicate. `MessageTemplate.Resolve` (line 47) rejects *any* non-zero status code — including Uncertain — and renders `{?}`. `AlarmPredicateContext.GetTag` returns `BadNodeIdUnknown` (`0x80340000`) for a missing path. The net effect: an Uncertain-quality tag is considered good enough to drive an alarm *activation* decision but not good enough to print in the alarm *message*. `docs/ScriptedAlarms.md` ("Fallback rules") only documents the message-template behaviour and does not mention that predicate evaluation accepts Uncertain. The two policies should be reconciled and documented.
**Recommendation:** Decide one quality policy for "is this input usable" and apply it in both `AreInputsReady` and the message resolver, or explicitly document why predicate evaluation and message rendering treat Uncertain differently. Add the predicate-side Uncertain rule to `docs/ScriptedAlarms.md`.
**Resolution:** _(open)_
### Core.ScriptedAlarms-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `Part9StateMachine.cs:275` |
| Status | Open |
**Description:** `TransitionResult.NoOp(state, reason)` takes a `reason` string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards it — it just returns `new(state, EmissionKind.None)`, identical to `None(state)`. Every call site that passes a carefully-worded reason string is doing dead work, and the comments in `Part9StateMachine` and the class-level remarks claim disabled/no-op transitions "produce ... a diagnostic log line", which they do not.
**Recommendation:** Either propagate the reason (add it to `TransitionResult` and have the engine log it at debug level when emission is `None` for a no-op), or remove the unused `reason` parameter and collapse `NoOp` into `None`. Update the `Part9StateMachine` remarks that promise a diagnostic log line.
**Resolution:** _(open)_
### Core.ScriptedAlarms-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs` |
| Status | Resolved |
**Description:** Several engine behaviours central to the module have no test coverage: (1) the 5-second shelving timer / timed-shelve auto-expiry through the *engine* — only the pure `Part9StateMachine.ApplyShelvingCheck` is tested, never `ScriptedAlarmEngine` driving the timer with an injectable clock; (2) `ConfirmAsync`, `TimedShelveAsync`, `UnshelveAsync`, `EnableAsync` engine methods (only `Acknowledge`, `OneShotShelve`, `Disable`, `AddComment` are exercised); (3) `OnEvent` subscriber-throws isolation (`EmitEvent` catch on line 357); (4) `IAlarmStateStore.SaveAsync` failure handling (finding 007); (5) re-entrant `LoadAsync` and the timer leak (finding 002); (6) the cold-start `AreInputsReady` guard with Bad / null / Uncertain inputs. The `clock` and `scriptTimeout` constructor parameters exist specifically to make timer/timeout tests deterministic but no test uses them.
**Recommendation:** Add engine-level tests that inject a controllable `Func<DateTime>` clock to drive `RunShelvingCheck`, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.
**Resolution:** Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via `RunShelvingCheckForTest`, `ConfirmAsync`/`TimedShelveAsync`/`UnshelveAsync`/`EnableAsync` end-to-end, subscriber-exception isolation, store-failure invariant, second-`LoadAsync` timer-leak regression, and `AreInputsReady` Bad/Uncertain guard; exposed `RunShelvingCheckForTest()` internal hook on the engine.
+338
View File
@@ -0,0 +1,338 @@
# Code Review — Core.Scripting
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.Scripting-004, Core.Scripting-005 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.Scripting-006 |
| 4 | Error handling & resilience | Core.Scripting-007 |
| 5 | Security | Core.Scripting-001, Core.Scripting-002, Core.Scripting-003 |
| 6 | Performance & resource management | Core.Scripting-008 |
| 7 | Design-document adherence | Core.Scripting-009 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Core.Scripting-010, Core.Scripting-011 |
| 10 | Documentation & comments | No issues found |
## Findings
### Core.Scripting-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Security |
| Location | `ForbiddenTypeAnalyzer.cs:45`, `ScriptSandbox.cs:54` |
| Status | Resolved |
**Description:** `System.Environment` lives in the allowed `System` namespace (it is in
`System.Private.CoreLib`, which is allow-listed for primitives) and is not on the
forbidden-namespace deny-list. Nothing prevents an operator-authored script from calling
`System.Environment.Exit(0)` or `System.Environment.FailFast("...")`. Both terminate the
host process immediately. Because scripted-alarm predicates and virtual-tag scripts run
in-process in the main OPC UA server (decision: "Scripting engine runs in the main .NET 10
server process"), a single malicious or buggy predicate brings down the entire server —
an outage affecting every connected client and every driver. `ScriptSandboxTests` only
pins the *read* path (`Environment.GetEnvironmentVariable`) as an accepted compromise; the
process-killing members are not considered. The whole-process kill far exceeds the
"read-only process state" justification the test comments rely on.
**Recommendation:** The forbidden surface must be member-granular, not namespace-granular,
for types in allowed namespaces. Add an explicit forbidden-member deny-list to
`ForbiddenTypeAnalyzer` covering at minimum `System.Environment.Exit`,
`System.Environment.FailFast`, `System.AppDomain`, `System.GC` (e.g. `GC.Collect`,
`GC.AddMemoryPressure`), and `System.Activator.CreateInstance` (a reflection-equivalent
escape). Reject these in `CheckSymbol` by resolved method symbol, with a test for each.
**Resolution:** Resolved 2026-05-22 — added a type-granular `ForbiddenFullTypeNames`
deny-list (`System.Environment`, `System.AppDomain`, `System.GC`, `System.Activator`) to
`ForbiddenTypeAnalyzer`; `CheckSymbol` now rejects any resolved type symbol whose
fully-qualified name matches, alongside the existing namespace-prefix check, so dangerous
`System`-namespace process-control types are blocked at compile while legitimate `System`
types (Math, String, …) stay usable. Regression tests added in `ScriptSandboxTests` for
`Environment.Exit` / `Environment.FailFast` / `Environment.GetEnvironmentVariable` /
`AppDomain` / `GC.Collect` / `Activator.CreateInstance`.
### Core.Scripting-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `ForbiddenTypeAnalyzer.cs:70` |
| Status | Resolved |
**Description:** The syntax walker only inspects four node kinds:
`ObjectCreationExpressionSyntax`, `InvocationExpressionSyntax` with a member-access target,
`MemberAccessExpressionSyntax`, and bare `IdentifierNameSyntax`. It never visits
`TypeOfExpressionSyntax`, generic type-argument lists (`GenericNameSyntax` /
`TypeArgumentListSyntax`), cast expressions (`CastExpressionSyntax`), `is`/`as` type
patterns, `default(T)` expressions, array-creation element types, or `using`/local
declared types. A script such as `typeof(System.IO.File)`,
`new System.Collections.Generic.List<System.IO.FileInfo>()`,
`(System.IO.Stream)null`, or `default(System.Reflection.Assembly)` references a forbidden
type without ever producing a node the walker examines, so the forbidden-type check is
bypassed. The Phase 7 plan A.6 explicitly calls out `typeof` as a sandbox-escape attempt
that "must fail at compile" — it currently does not.
**Recommendation:** Walk every `TypeSyntax` node (handle `TypeOfExpressionSyntax`,
`CastExpressionSyntax`, generic argument lists, and the type operand of
`IsPatternExpression` / binary `as`). The simplest robust fix is to enumerate all
`DescendantNodes()` and, for any node, resolve both `GetSymbolInfo` and `GetTypeInfo`,
then check the resolved type plus every type argument. Add tests covering `typeof`,
generic arguments, casts, and `default(T)` with forbidden types.
**Resolution:** Resolved 2026-05-22 — `ForbiddenTypeAnalyzer.Analyze` now runs a second
pass that resolves `GetTypeInfo` on every `TypeSyntax` node and recursively unwraps array
element types and generic type arguments, so forbidden types named via `typeof`, generic
arguments (`List<FileInfo>`), casts, `is`/`as` patterns, `default(T)`, array-creation
element types, and explicitly-typed local declarations are all rejected at compile. The
original member/call node-kind switch is kept (deliberately narrow to avoid flagging
inherited members such as `typeof(int).Name`), and a span+type dedupe prevents duplicate
rejections from the two passes. Regression tests added in `ScriptSandboxTests` for each
node form plus over-block guards for allowed generics/`typeof`.
### Core.Scripting-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `TimedScriptEvaluator.cs:9`, `ScriptSandbox.cs:30` |
| Status | Resolved |
**Description:** There is no bound on memory a script may allocate or on the number of
threads/tasks a script may spawn. The class docs acknowledge unbounded memory as "a budget
concern" deferred to v3, but in-process execution means a script doing
`new byte[int.MaxValue]` repeatedly (or `Enumerable.Range(0,int.MaxValue).ToList()` — LINQ
is allow-listed) can drive the whole server to `OutOfMemoryException`, an outage. The
timeout does not help: the allocation can exhaust memory well before 250ms elapses, and
the orphaned thread-pool thread documented in `TimedScriptEvaluator` keeps the allocation
rooted. `System.Threading.Tasks` is not on the deny-list, so a script can also
`Task.Run` an unbounded fan-out of background work that outlives the timeout entirely.
**Recommendation:** At minimum, document this as a known accepted risk in
`docs/ScriptedAlarms.md` / `docs/VirtualTags.md` rather than only in a code comment, and
add the `Task`/`Parallel` namespaces to the forbidden list (scripts are synchronous
predicates — they have no legitimate need to start background tasks). For memory, gate
script authoring behind an Admin permission and treat the test-harness preview as the
control point, or track an explicit v3 issue for out-of-process execution. Record the
decision so it is not silently lost.
**Resolution:** Resolved 2026-05-22 — added `System.Threading.Tasks` to `ForbiddenNamespacePrefixes` (blocking `Task.Run` / `Parallel` fan-out); documented the unbounded-memory accepted risk and the `Task` denial rationale in `docs/VirtualTags.md` (new "Known resource limits" subsection) and cross-referenced from `docs/ScriptedAlarms.md`.
### Core.Scripting-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `DependencyExtractor.cs:73` |
| Status | Resolved |
**Description:** The walker matches tag-access calls purely by spelling — any
`InvocationExpressionSyntax` whose member name is `GetTag` or `SetVirtualTag` is treated as
a `ScriptContext` tag access, regardless of the receiver. A script that defines a local
type with a `GetTag(string)` method and calls `other.GetTag("X")`, or calls
`this.GetTag(...)` on a script-defined helper, has spurious dependencies harvested (or, if
the literal arg is non-literal, spurious rejections raised). The XML remarks claim "as long
as it's not on the ctx instance, the extractor doesn't pick it up", but the code does not
check that the receiver is the `ctx` identifier — it accepts any member access with the
matching name. The `DependencyExtractorTests.Ignores_non_ctx_method_named_GetTag` test
passes only because the helper there is a *free* function (not member-access form); a
member-access call to a non-ctx `GetTag` is untested and would be misattributed.
**Recommendation:** In `VisitInvocationExpression`, additionally require that
`member.Expression` is an `IdentifierNameSyntax` with `Identifier.ValueText == "ctx"`
(matching the `ScriptGlobals<TContext>.ctx` field name). Add a test for
`someOtherObject.GetTag("X")` asserting it is ignored.
**Resolution:** Resolved 2026-05-22 — `VisitInvocationExpression` now additionally checks that `member.Expression` is an `IdentifierNameSyntax` with `ValueText == "ctx"` before treating the call as a dependency; test `Ignores_member_access_GetTag_on_non_ctx_receiver` added to `DependencyExtractorTests`.
### Core.Scripting-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `DependencyExtractor.cs:97` |
| Status | Open |
**Description:** A raw string literal token passed as the tag path (a raw triple-quote
literal) tokenizes as `SingleLineRawStringLiteralToken` /
`MultiLineRawStringLiteralToken`, not `StringLiteralToken`. The check
`literal.Token.IsKind(SyntaxKind.StringLiteralToken)` therefore rejects an
otherwise-static raw-string path as a non-literal "dynamic path", producing a misleading
rejection message. This is an edge case (operators rarely write raw strings for tag
paths) but the error text would confuse anyone who does.
**Recommendation:** Accept all string-literal token kinds — check
`literal.IsKind(SyntaxKind.StringLiteralExpression)` on the expression node, or include
the raw-string token kinds, so a static raw string is harvested rather than rejected.
**Resolution:** _(open)_
### Core.Scripting-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `CompiledScriptCache.cs:55` |
| Status | Open |
**Description:** On a failed compile the `catch` block calls
`_cache.TryRemove(key, out _)` without a value comparison. If two threads race a miss for
the same bad source, both observe the same faulted `Lazy` and throw, and both call
`TryRemove(key)`. If a concurrent retry re-adds a new `Lazy` for that key between the two
removals, the second unconditional `TryRemove` could evict the in-flight retry entry. The
window is small and the consequence is only a redundant recompile, so severity is Low —
but the removal should be key+value scoped for correctness.
**Recommendation:** Use the `ConcurrentDictionary.TryRemove(KeyValuePair<,>)` overload to
remove only the specific faulted `Lazy` instance, so a concurrently re-added entry is not
evicted.
**Resolution:** _(open)_
### Core.Scripting-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `TimedScriptEvaluator.cs:60` |
| Status | Resolved |
**Description:** `RunAsync` wraps the inner run in `Task.Run(...)` and then awaits
`WaitAsync(Timeout, ct)`. If the caller-supplied `ct` cancels at roughly the same time the
timeout elapses, the order in which `WaitAsync` observes the timeout vs. the cancellation
is non-deterministic, so the same shutdown can sometimes surface as
`ScriptTimeoutException` and sometimes as `OperationCanceledException`. The class docs
assert "the caller's cancel wins" as a hard guarantee that the virtual-tag engine shutdown
path depends on to avoid misclassifying shutdown as a script fault — but the
implementation does not guarantee it when both fire close together.
**Recommendation:** After catching `TimeoutException`, check `ct.IsCancellationRequested`
and throw `OperationCanceledException(ct)` instead of `ScriptTimeoutException` when the
caller's token is cancelled, so caller cancellation deterministically wins regardless of
race ordering.
**Resolution:** Resolved 2026-05-22 — in the `catch (TimeoutException)` handler, `ct.IsCancellationRequested` is now checked and `OperationCanceledException(ct)` thrown before `ScriptTimeoutException`, so caller cancellation deterministically wins regardless of race ordering; regression test `Caller_cancellation_wins_even_when_timeout_fires_first` added to `TimedScriptEvaluatorTests`.
### Core.Scripting-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `CompiledScriptCache.cs:34`, `ScriptEvaluator.cs:34` |
| Status | Open |
**Description:** `CompiledScriptCache` has no capacity bound (acknowledged in the class
remarks) and no eviction. Each cached `ScriptEvaluator` holds a Roslyn `ScriptRunner<T>`
delegate, which keeps the dynamically emitted script assembly loaded for the process
lifetime — emitted assemblies in the default `AssemblyLoadContext` cannot be unloaded.
`Clear()` drops the dictionary entries but does **not** unload the emitted assemblies;
they leak. Across many config-generation publishes (each `Clear()` followed by recompiling
every script), the process accumulates dead script assemblies. For the expected "low
thousands" of scripts this is benign, but a long-running server with frequent publishes
will see steady managed-memory growth that never returns.
**Recommendation:** Document the per-publish assembly accretion as a known limitation, or
compile scripts into a collectible `AssemblyLoadContext` so `Clear()` can unload prior
generations. At minimum add a note to `docs/ScriptedAlarms.md` so operators with
high-publish-frequency deployments are aware.
**Resolution:** _(open)_
### Core.Scripting-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ForbiddenTypeAnalyzer.cs:45` |
| Status | Open |
**Description:** The Phase 7 plan decision #6
(`docs/v2/implementation/phase-7-scripting-and-alarming.md`) enumerates the forbidden
surface as "No HttpClient / File / Process / reflection". `ForbiddenTypeAnalyzer` actually
denies a broader set — `System.Threading.Thread`, `System.Runtime.InteropServices`, and
`Microsoft.Win32` (registry) — which is sensible hardening but is undocumented in the plan
and in `docs/ScriptedAlarms.md` (which defers sandbox rules to `VirtualTags.md`). An
operator reading the design docs cannot predict that a registry or interop reference will
be rejected. Conversely the plan does not record the `System.Environment` /
`System.Diagnostics` decisions. The code and the design document have drifted.
**Recommendation:** Update the plan's decision #6 (or `docs/VirtualTags.md`) to list the
authoritative deny-list exactly as `ForbiddenTypeAnalyzer.ForbiddenNamespacePrefixes`
defines it, including the `System.Environment` allowed-compromise, so the docs match the
code.
**Resolution:** _(open)_
### Core.Scripting-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/ScriptSandboxTests.cs:54` |
| Status | Resolved |
**Description:** The sandbox-escape test suite covers only the four obvious vectors
(File / Http / Process / Reflection) as direct member-access calls. It does not test:
`typeof(forbidden)`, generic type arguments (`List<FileInfo>`), cast expressions to
forbidden types, `System.Environment.Exit` / `FailFast`, `System.Threading.Thread`,
`System.Runtime.InteropServices`, `Microsoft.Win32` registry access, `Activator`, or
`System.AppDomain`. Given that the analyzer is the sole security boundary for in-process
untrusted-script execution, the gaps in Core.Scripting-001 and Core.Scripting-002 went
undetected precisely because no test exercises those forms. The Phase 7 plan A.6 mandated
"sandbox escape tests" but the implemented set is materially narrower than the threat
surface.
**Recommendation:** Add a parameterised escape-test covering every node form in
Core.Scripting-002 and every forbidden namespace/member in Core.Scripting-001. Each must
assert a `ScriptSandboxViolationException` (or `CompilationErrorException`) at compile.
**Resolution:** Resolved 2026-05-22 — added `ScriptSandboxTests` cases for `System.Threading.Thread`, `System.Threading.Tasks.Task.Run`, `System.Runtime.InteropServices.Marshal`, and `Microsoft.Win32.Registry` (the four namespace-deny-list vectors that had no test); the 001/002 vectors (Environment.Exit/FailFast/AppDomain/GC/Activator, typeof, generics, cast, default(T), is/as, array element, declared variable) were already covered by the -001/-002 resolution commits. All 79 tests pass.
### Core.Scripting-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/` |
| Status | Open |
**Description:** Two source files have no direct test coverage: `ScriptContext`
(`Deadband` static helper is exercised only indirectly through `ScriptSandboxTests`, and
not for its boundary `tolerance` behaviour) and `ScriptSandbox.Build` itself (the
`ArgumentNullException` / `ArgumentException` guards on `contextType` at
`ScriptSandbox.cs:45-48` are never asserted). `ScriptLogCompanionSink` and
`ScriptLoggerFactory` have tests, but there is no test that a script's `ctx.Logger` Error
emission surfaces via the companion sink end-to-end (factory + sink integration is
untested). These are minor gaps but leave guard clauses and the logging integration
unverified.
**Recommendation:** Add unit tests for `ScriptSandbox.Build` argument validation, for
`ScriptContext.Deadband` at and around the tolerance boundary, and an end-to-end test that
a script logging at Error level produces both a `scripts-*.log` event and a companion
Warning event.
**Resolution:** _(open)_
+359
View File
@@ -0,0 +1,359 @@
# Code Review - Core.VirtualTags
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 7 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.VirtualTags-001, Core.VirtualTags-002, Core.VirtualTags-003, Core.VirtualTags-004 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.VirtualTags-005, Core.VirtualTags-006 |
| 4 | Error handling & resilience | Core.VirtualTags-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Core.VirtualTags-008, Core.VirtualTags-009 |
| 7 | Design-document adherence | Core.VirtualTags-001, Core.VirtualTags-010 |
| 8 | Code organization & conventions | Core.VirtualTags-011 |
| 9 | Testing coverage | Core.VirtualTags-012 |
| 10 | Documentation & comments | Core.VirtualTags-010, Core.VirtualTags-013 |
## Findings
### Core.VirtualTags-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:306` |
| Status | Resolved |
**Description:** `OnScriptSetVirtualTag` updates `_valueCache`, notifies observers, and
records history for the written path, but it does not schedule a cascade for tags that
depend on the written path. `docs/VirtualTags.md` (VirtualTagContext section) explicitly
states `SetVirtualTag(path, value)` "routes through the engine's `OnScriptSetVirtualTag`
callback so cross-tag writes still participate in change-trigger cascades." They do not.
A script that writes `ctx.SetVirtualTag("Target", x)` updates Target's cached value, but
any virtual tag whose script reads Target via `ctx.GetTag("Target")` and is
`ChangeTriggered = true` is never re-evaluated. Downstream virtual tags go stale until
some unrelated trigger fires. The existing test
`SetVirtualTag_within_script_updates_target_and_triggers_observers` only asserts the
target itself updates and never exercises a tag depending on the target, so the gap is
not caught.
**Recommendation:** Either (a) launch a fire-and-forget `CascadeAsync(path, ...)` from
`OnScriptSetVirtualTag` (note `EvaluateInternalAsync` acquires the non-reentrant
`_evalGate`, so the cascade must be scheduled, not invoked inline while the gate is
held), or (b) if cascading from a script write is intentionally unsupported, correct the
documentation and `VirtualTagContext` XML doc to say so. Decide deliberately and make
code and docs agree.
**Resolution:** Resolved 2026-05-22 — `OnScriptSetVirtualTag` now launches a fire-and-forget `CascadeAsync(path, ...)` after updating the cache, mirroring `OnUpstreamChange`, so change-triggered dependents of a script-written tag are re-evaluated; added a regression test.
### Core.VirtualTags-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:237` |
| Status | Resolved |
**Description:** The cold-start guard `if (!AreInputsReady(ctxCache)) return;` silently
abandons the evaluation when any input is null or Bad-quality. For a chained virtual tag
(C depends on B depends on driver tag A), if A is still Bad at startup, B is skipped --
leaving B's `_valueCache` entry absent. When C evaluates, `BuildReadCache` falls through
to `_upstream.ReadTag("B")` for the missing virtual path, which returns BadNodeIdUnknown
quality, so C is also skipped. That is acceptable for cold start, but the same guard
means a virtual tag that legitimately consumes a Bad-quality upstream (e.g. a script
written to detect comms loss and emit a fallback) can never run -- it is permanently
frozen at its prior value with no diagnostic. The tag also never transitions to a Bad
quality of its own, so an OPC UA client cannot distinguish "not yet computed" from
"computing fine."
**Recommendation:** Make the cold-start behaviour explicit: when inputs are not ready,
publish a Bad-quality snapshot (e.g. BadWaitingForInitialData, 0x80320000) for the tag
rather than returning with no state change, so clients see a defined quality. If
operators need scripts that handle Bad upstreams, consider a per-definition opt-out of
the readiness guard.
**Resolution:** Resolved 2026-05-22 — cold-start guard now publishes `BadWaitingForInitialData` (0x80320000) and notifies observers instead of silently returning, so OPC UA clients see a defined quality rather than a stale prior value.
### Core.VirtualTags-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:117-120` |
| Status | Resolved |
**Description:** The upstream-subscription loop in `Load` iterates
`definitions.SelectMany(d => _tags[d.Path].Reads)`. If `definitions` contains two rows
with the same Path, the first registers `_tags[Path]` and the second overwrites it, but
`definitions` still has two entries -- `_tags[d.Path]` is indexed by the second row for
both iterations, so the first row's distinct upstream reads are silently dropped. More
importantly, a duplicate Path in the input list is never rejected at all:
`_tags[def.Path] = ...` and `_graph.Add(def.Path, ...)` both overwrite without warning,
so one of two operator-authored tags with a colliding UNS path vanishes with no error.
`Load` is documented as throwing an aggregated error for every problem; a duplicate path
should be in that set.
**Recommendation:** Detect duplicate Path values while iterating `definitions` and add
them to `compileFailures` (or a dedicated rejection list) so the aggregated
`InvalidOperationException` reports them. Separately, iterate `_tags.Values` rather than
`definitions.SelectMany(d => _tags[d.Path]...)` when collecting upstream paths so the
collection is keyed off the registered set, not the raw input list.
**Resolution:** Resolved 2026-05-22 — `Load` now tracks seen paths and adds a duplicate-path entry to `compileFailures`; the upstream-subscription loop iterates `_tags.Values` instead of the raw `definitions` list so it is keyed off the registered set.
### Core.VirtualTags-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:349` |
| Status | Open |
**Description:** `CoerceResult`'s switch has a default arm (`_ => raw`) that returns the
script's raw return value uncoerced for any `DriverDataType` not in the explicit list
(e.g. an array type, Byte, or a future enum member). The resulting `DataValueSnapshot`
then carries a value whose CLR type does not match the node's declared OPC UA data type,
which the node manager will surface as a wire-level type mismatch or a silently wrong
value. The doc claims a mismatch surfaces as BadTypeMismatch, but an unhandled
`DriverDataType` bypasses coercion entirely.
**Recommendation:** Make the default arm explicit -- either throw / return null (which
the outer pipeline maps to BadInternalError) for an unsupported `DriverDataType`, or
document precisely which `DriverDataType` values `CoerceResult` supports and validate at
`Load` time that no definition declares an unsupported type.
**Resolution:** _(open)_
### Core.VirtualTags-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagSource.cs:50-64` |
| Status | Resolved |
**Description:** `SubscribeAsync` registers the per-path engine observers first (lines
52-56), then in a second loop reads the current value and fires the initial-data
callback (lines 60-64). Between those two loops an upstream change can cascade and the
engine can invoke the just-registered observer with a new value. The OPC UA client then
receives the real change event followed by the initial-data event carrying the older
`engine.Read(path)` snapshot -- out-of-order delivery, and the client's last-known value
ends up stale.
**Recommendation:** Capture the current snapshot and fire the initial-data callback for
each path before registering the change observer for that path (or hold a per-handle
lock spanning both so no engine callback interleaves). The initial value must be
delivered before any subsequent change for that path.
**Resolution:** Resolved 2026-05-22 — `SubscribeAsync` now fires the initial-data callback per path before registering the change observer for that path, eliminating the out-of-order delivery race.
### Core.VirtualTags-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:177-182`, `:395-401` |
| Status | Open |
**Description:** `Subscribe` does `_observers.GetOrAdd(path, _ => [])` then
`lock (list) { list.Add(observer); }`. When `Unsub.Dispose` removes the last observer,
the now-empty List is left in `_observers` and the dictionary entry is never removed.
For a long-running server with churning OPC UA subscriptions this is an unbounded (if
slow) growth of empty lists. There is also a benign-but-real race: a thread can call
`GetOrAdd` and obtain a list reference that another thread's `Dispose` is about to leave
empty in the map -- not a correctness bug today because the list object is still valid,
but it makes any future "prune empty entries" logic racy.
**Recommendation:** Either accept the unbounded map and document it, or have
`Unsub.Dispose` remove the dictionary entry when the list becomes empty under the same
lock, re-checking emptiness inside the lock to avoid dropping a concurrently-added
observer.
**Resolution:** _(open)_
### Core.VirtualTags-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/TimerTriggerScheduler.cs:58` |
| Status | Open |
**Description:** `Tick` calls
`_engine.EvaluateOneAsync(p, _cts.Token).GetAwaiter().GetResult()`, blocking the
`System.Threading.Timer` callback thread (a thread-pool thread) for the full duration of
the evaluation. Because `EvaluateInternalAsync` serialises all tags through `_evalGate`,
a timer tick that races a long change-trigger cascade blocks until the cascade drains.
With multiple interval groups, several timer callbacks can each pin a thread-pool thread
waiting on the same gate. A group of N tags can take N times the script timeout while
holding a pool thread, and under timer re-entrancy (a tick firing again before the prior
finished) this compounds.
**Recommendation:** Make `Tick` async-aware -- store the returned Task and skip a tick
if the previous one for that group is still running (a per-group "in flight" flag),
rather than blocking synchronously. At minimum, document the blocking behaviour and the
expected upper bound on group evaluation time relative to the interval.
**Resolution:** _(open)_
### Core.VirtualTags-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/DependencyGraph.cs:81-115` |
| Status | Resolved |
**Description:** `TransitiveDependentsInOrder` calls `TopologicalSort()` (a full O(V+E)
Kahn pass plus a Dictionary rank build) on every invocation, and it is invoked from
`CascadeAsync` on every upstream change event (`OnUpstreamChange`). On a large graph with
high-rate upstream tags this re-sorts the entire dependency graph on every protocol-rate
delta -- pure waste, since the topological order is immutable between `Load` calls. The
DFS that collects dependents is itself fine; only the repeated sort is the cost.
**Recommendation:** Compute the topological order (and the rank dictionary) once at the
end of `Load` and cache it on `DependencyGraph` (invalidated by `Add` / `Clear`).
`TransitiveDependentsInOrder` then reuses the cached rank map. This turns a per-event
O(V+E) cost into an O(closure) cost.
**Resolution:** Resolved 2026-05-22 — `DependencyGraph` now caches the topological rank dictionary (invalidated by `Add`/`Clear`) via `GetOrBuildRank()`; `TransitiveDependentsInOrder` reuses it, reducing per-change-event cost from O(V+E) to O(closure).
### Core.VirtualTags-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/DependencyGraph.cs:64-65`, `:72-73` |
| Status | Open |
**Description:** `DirectDependencies` and `DirectDependents` allocate a fresh empty
`HashSet<string>` on every call for an unregistered node. `DirectDependents` is called
inside the `TopologicalSort` Kahn loop and the `CascadeAsync` DFS, so for a graph with
many leaf driver tags this allocates a throwaway set per leaf per sort. Minor, but it is
on the change-cascade path.
**Recommendation:** Return a shared static empty set for the miss case instead of
allocating each time.
**Resolution:** _(open)_
### Core.VirtualTags-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/ITagUpstreamSource.cs:18`, `VirtualTagContext.cs:30`, `VirtualTagDefinition.cs:28` |
| Status | Open |
**Description:** Several XML docs reference component names that do not exist in the
codebase. `ITagUpstreamSource` XML doc says the subscription path "feeds the engine's
ChangeTriggerDispatcher" -- there is no ChangeTriggerDispatcher; the actual path is
`OnUpstreamChange` then `CascadeAsync`. `VirtualTagDefinition`'s TimerInterval and
`VirtualTagContext` docs reference an EvaluationPipeline that likewise does not exist;
the real type is `EvaluateInternalAsync` inside `VirtualTagEngine`. Stale type names in
XML docs mislead maintainers searching for the named component.
**Recommendation:** Update the XML docs to name the real types (`VirtualTagEngine`,
`CascadeAsync`, `EvaluateInternalAsync`) or drop the specific name in favour of a
behavioural description.
**Resolution:** _(open)_
### Core.VirtualTags-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:404-409` |
| Status | Open |
**Description:** `VirtualTagState` records a Writes set (the `ctx.SetVirtualTag` targets
extracted by `DependencyExtractor`), but nothing in the engine reads it -- it is captured
at `Load` and never used. Declared write targets are not validated against the registered
tag set at publish time (a script writing to a non-existent virtual path is only caught
at runtime by `OnScriptSetVirtualTag`'s warning-and-drop), and they do not contribute to
the dependency graph. Either the field is dead state or an intended publish-time
validation is missing.
**Recommendation:** Use Writes to validate at `Load` that every `ctx.SetVirtualTag`
target resolves to a registered virtual tag (adding an entry to `compileFailures` on a
miss), so an operator typo is caught at publish rather than silently dropped at runtime.
If validation is deliberately deferred, remove the unused field or comment why it is
retained.
**Resolution:** _(open)_
### Core.VirtualTags-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags.Tests/` |
| Status | Resolved |
**Description:** Several behaviours of the engine have no test coverage:
(1) the cold-start `AreInputsReady` guard -- no test exercises an upstream that is
null/Bad at evaluation time and asserts the resulting tag state (see
Core.VirtualTags-002);
(2) `ctx.SetVirtualTag` cascading to a dependent of the written tag -- the existing test
only checks the written tag itself, so the gap in Core.VirtualTags-001 is invisible to
the suite;
(3) the `OnScriptSetVirtualTag` warning path for a write to a non-registered path;
(4) `EvaluateOneAsync` throwing `ArgumentException` for an unregistered path;
(5) `CoerceResult` failure mapping to BadInternalError (only the success coercion
double-to-int32 is tested);
(6) duplicate Path values in a `Load` definition list (see Core.VirtualTags-003);
(7) `Read`/`Subscribe`/`EvaluateOneAsync` calls before `Load` (the `EnsureLoaded` guard).
**Recommendation:** Add unit tests for each path above. Items (1), (2), and (6) directly
correspond to open correctness findings and would have caught them.
**Resolution:** Resolved 2026-05-22 — added 9 unit tests covering all 7 gaps: `AreInputsReady` guard publishes `BadWaitingForInitialData` and recovers; `SetVirtualTag` cascade to dependent; write to non-registered path; `EvaluateOneAsync` before `Load` and for unregistered path; `CoerceResult` failure maps to `BadInternalError`; duplicate-path rejection; `Read`/`Subscribe` before `Load`.
### Core.VirtualTags-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/DependencyGraph.cs:266-270` |
| Status | Open |
**Description:** `DependencyCycleException.BuildMessage` renders each cycle as
`string.Join(" -> ", c) + " -> " + c[0]`, presenting the SCC member list as a traversable
edge path that loops back to its first element. Tarjan's algorithm returns the members of
a strongly-connected component in stack-pop order, which is not guaranteed to be a valid
edge sequence -- for an SCC larger than 2 nodes the printed "A -> B -> C -> A" may list
edges that do not exist. The message can therefore mislead an operator debugging a cycle
into looking for an edge that is not in their config.
**Recommendation:** Either label the output as "cycle members" (a set, not an ordered
path) rather than rendering arrows, or reconstruct an actual cycle path within the SCC
(a single DFS back-edge walk) before formatting.
**Resolution:** _(open)_
+207
View File
@@ -0,0 +1,207 @@
# Code Review — Core
| Field | Value |
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core-001, Core-002, Core-003 |
| 2 | OtOpcUa conventions | Core-004 |
| 3 | Concurrency & thread safety | Core-005, Core-006 |
| 4 | Error handling & resilience | Core-007, Core-008 |
| 5 | Security | Core-002 |
| 6 | Performance & resource management | Core-009 |
| 7 | Design-document adherence | Core-002, Core-003 |
| 8 | Code organization & conventions | Core-010 |
| 9 | Testing coverage | Core-011 |
| 10 | Documentation & comments | Core-012 |
## Findings
### Core-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs:50-68` |
| Status | Resolved |
**Description:** `NeedsRefresh` can never return `true` with the default field values. `AuthCacheMaxStaleness` defaults to 5 minutes and `MembershipFreshnessInterval` defaults to 15 minutes. `NeedsRefresh(utcNow)` is defined as `!IsStale(utcNow) && elapsed > MembershipFreshnessInterval`, i.e. it needs `elapsed > 15 min` AND `elapsed <= 5 min` simultaneously — an empty set. The session crosses the staleness ceiling (5 min) and fails closed long before it ever reaches the 15-minute freshness boundary that is supposed to signal "kick off an async re-resolution while still serving cached memberships." Decision #151 / #152 in `docs/v2/implementation/phase-6-2-authorization-runtime.md` intends the freshness window (15 min, re-resolve) to be the inner trigger and the staleness ceiling to be the outer hard limit; with these defaults the ordering is inverted, so the "refresh while warm" path is dead code and every long-lived session hard-fails authorization after 5 minutes.
**Recommendation:** Either swap the defaults so `MembershipFreshnessInterval` (e.g. 5 min) is strictly less than `AuthCacheMaxStaleness` (e.g. 15 min) — matching the doc's stated intent — or, if the 5/15 values are correct, redefine which window is the refresh trigger and which is the fail-closed ceiling. Add a unit test asserting `NeedsRefresh` returns `true` for at least one point in time with the production defaults.
**Resolution:** Resolved 2026-05-22 — swapped the defaults so `MembershipFreshnessInterval` is 5 min and `AuthCacheMaxStaleness` is 15 min (freshness = inner re-resolve trigger, staleness = outer fail-closed ceiling); added a `NeedsRefresh_FiresWithin_ProductionDefault_Windows` regression test.
### Core-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs:24-50` |
| Status | Resolved |
**Description:** `TriePermissionEvaluator.Authorize` never compares the session's `AuthGenerationId` against the generation of the trie it evaluates against. It calls `_cache.GetTrie(scope.ClusterId)` — the current-generation shortcut — and authorizes against whatever generation the cache happens to hold. `UserAuthorizationState` carries `AuthGenerationId` precisely so a stale session can be detected, and the Phase 6.2 design (`phase-6-2-authorization-runtime.md` adversarial-review item #3 "Redundancy-safe invalidation", plus the §Scope `PermissionTrieCache + freshness` row) requires the hot-path call to look up `CurrentGenerationId` and force a re-evaluation on mismatch. As written, a session bound at generation N silently evaluates against generation N+1 the instant another node publishes — grants added or removed in N+1 take effect for that session without the intended generation-stamp re-check, and the provenance returned in `AuthorizationDecision` misreports which generation produced the verdict.
**Recommendation:** In `Authorize`, after resolving the trie, compare `trie.GenerationId` to `session.AuthGenerationId`. On mismatch either fetch the session's bound generation via `_cache.GetTrie(clusterId, session.AuthGenerationId)` and evaluate against it, or signal the caller to re-resolve the session's auth state before retrying. Add a test for the publish-during-session scenario.
**Resolution:** Resolved 2026-05-22 — `Authorize` now compares `trie.GenerationId` to `session.AuthGenerationId` and, on mismatch, re-fetches the session's bound generation via `_cache.GetTrie(clusterId, session.AuthGenerationId)`, failing closed when that generation has been pruned; added publish-during-session and pruned-generation regression tests.
### Core-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98` |
| Status | Resolved |
**Description:** `WalkSystemPlatform` records every Galaxy folder-segment grant with `NodeAclScopeKind.Equipment` (see the comment at lines 82-86) because `NodeAclScopeKind` has no `FolderSegment` member. The functional union of permission flags is unaffected, but the `MatchedGrant.Scope` carried in `AuthorizationDecision.Provenance` is wrong for Galaxy nodes: a grant anchored at a namespace-root folder and a grant anchored at a deep sub-folder both report `Equipment`, and a namespace-level grant is indistinguishable from a folder-level grant in the audit trail and the Admin UI "Probe this permission" diagnostic. The Phase 6.2 design (adversarial-review item #6) calls for a dedicated `FolderSegment` scope level. The current code is a known shortcut but references only an untracked "TODO" with no issue ID.
**Recommendation:** Add a `FolderSegment` member to `NodeAclScopeKind` and use it in `WalkSystemPlatform` and `PermissionTrieBuilder` so Galaxy folder grants report their true scope. If the enum change is deferred, file a tracked issue and reference its ID in the code comment.
**Resolution:** Resolved 2026-05-22 — added `FolderSegment` to `NodeAclScopeKind`; `WalkSystemPlatform` now reports `FolderSegment` instead of `Equipment` for each visited Galaxy folder level; added three regression tests asserting the correct scope is reported in `MatchedGrant.Scope`.
### Core-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs:55,72,87` |
| Status | Open |
**Description:** `DriverHost` is a library type whose async calls (`driver.InitializeAsync`, `driver.ShutdownAsync`) do not use `ConfigureAwait(false)`, whereas the sibling `CapabilityInvoker` and `AlarmSurfaceInvoker` in the same module consistently do. The server host has no synchronization context so behaviour is currently correct, but the inconsistency is a maintenance hazard and a deviation from the established convention in `Core.Resilience`.
**Recommendation:** Add `.ConfigureAwait(false)` to the three awaited calls in `DriverHost.RegisterAsync`, `UnregisterAsync`, and `DisposeAsync`.
**Resolution:** _(open)_
### Core-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70` |
| Status | Resolved |
**Description:** `Prune` mutates the `ConcurrentDictionary` with a plain indexer assignment (`_byCluster[clusterId] = new ClusterEntry(...)`) after a separate `TryGetValue` read. If `Install` runs concurrently for the same cluster, the `AddOrUpdate` in `Install` and the indexer write in `Prune` race: `Prune` can read an entry, `Install` then adds a newer generation via `AddOrUpdate`, and `Prune`'s unconditional indexer write then overwrites the entry — silently dropping the just-installed newest generation and its `Current` pointer. The class is documented as a process-singleton accessed on the hot path while publishes install new tries, so the race is reachable.
**Recommendation:** Make `Prune` use an atomic compare-and-swap loop — `_byCluster.TryUpdate(clusterId, prunedEntry, observedEntry)` retried until it succeeds or the key is gone — or perform the prune inside an `AddOrUpdate` update factory.
**Resolution:** Resolved 2026-05-22 — changed `ClusterEntry` from `sealed record` to `sealed class` (enabling reference-equality CAS via `TryUpdate`); `Prune` now uses a read-compute-`TryUpdate` retry loop that restarts if another thread updates the entry between the read and the write; added regression tests asserting the current generation is preserved after a concurrent install + prune sequence.
### Core-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64` |
| Status | Resolved |
**Description:** `BuildAddressSpaceAsync` is not guarded against being called more than once. A second call subscribes a second `_alarmForwarder` to `IAlarmSource.OnAlarmEvent` and overwrites the `_alarmForwarder` field, so the first delegate is leaked (still subscribed, never unsubscribed because `Dispose` only removes the field's current value). Every alarm transition would then be delivered to its sink twice. The address-space rebuild path on Galaxy redeploy (`DeployWatcher``IRediscoverable.OnRediscoveryNeeded` → server rebuilds the address space) is exactly the scenario where a node manager could legitimately be re-walked. There is also no check of the `_disposed` flag at the top of the method.
**Recommendation:** Either guard `BuildAddressSpaceAsync` so a second call throws `InvalidOperationException` (document it single-shot), or unsubscribe the previous `_alarmForwarder` and clear `_alarmSinks` before re-walking. Also check `_disposed` and throw `ObjectDisposedException` if already disposed.
**Resolution:** Resolved 2026-05-22 — `BuildAddressSpaceAsync` now checks `_disposed` (throws `ObjectDisposedException`) and tears down the previous alarm forwarder + clears the sink registry before re-walking so a Galaxy-redeploy rebuild does not double-subscribe the forwarder; added three regression tests covering double-build, sink-count after rebuild, and post-dispose throw.
### Core-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83` |
| Status | Resolved |
**Description:** `UnsubscribeAsync` always routes through `_defaultHost`, even when an `IPerCallHostResolver` is wired and the original `SubscribeAsync` fanned the subscription out to a non-default host. The `IAlarmSubscriptionHandle` is opaque here and carries no host association, so an unsubscribe for a subscription created against host B runs through host A's resilience pipeline. In a multi-host driver this charges the wrong host's circuit breaker / bulkhead and, if host A is open while host B is healthy, can spuriously block a valid unsubscribe. The XML doc claims it routes "for parity" with `SubscribeAsync` but subscribe is per-host and unsubscribe is not.
**Recommendation:** Carry the resolved host on the `IAlarmSubscriptionHandle` (or in a handle→host map kept by `AlarmSurfaceInvoker`) so `UnsubscribeAsync` routes through the same host's pipeline the subscription was created on.
**Resolution:** Resolved 2026-05-22 — `SubscribeAsync` now wraps each driver handle in a `HostBoundHandle` (private `IAlarmSubscriptionHandle` implementation) that carries the resolved host name; `UnsubscribeAsync` unwraps it and routes through the recorded host's pipeline, falling back to the default host for handles not created by this invoker; added two regression tests verifying per-host routing and single-host fallback.
### Core-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64` |
| Status | Open |
**Description:** The XML summary of `BuildAddressSpaceAsync` states "Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted, but other drivers remain available." The method body contains no such isolation: an exception from `discovery.DiscoverAsync` propagates straight out unhandled, and nothing here marks a subtree Faulted. The isolation is presumably done by the server-layer caller, but the comment asserts behaviour this class does not implement.
**Recommendation:** Either implement the documented isolation in `GenericDriverNodeManager`, or correct the XML doc to state that exception isolation is the caller's responsibility and name the type that performs it.
**Resolution:** _(open)_
### Core-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs:121-128` |
| Status | Open |
**Description:** `ExecuteWriteAsync` calls `_optionsAccessor()` three times for a single non-idempotent write (once for the `with` expression, once inside the dictionary initializer for `.Resolve(...)`, plus the discarded base). On the per-write hot path it rebuilds a fresh `DriverResilienceOptions` and a one-entry dictionary on every non-idempotent write, and the redundant accessor calls could observe two different snapshots if an Admin edit lands between them. Phase 6.1 budgets a 1% pipeline overhead; this is unnecessary allocation plus a minor consistency hazard.
**Recommendation:** Capture `var options = _optionsAccessor();` once at the top of the non-idempotent branch and derive both the `with` and the `Resolve` call from that snapshot. Consider caching the no-retry pipeline keyed on `(hostName, non-idempotent)`.
**Resolution:** _(open)_
### Core-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceOptions.cs:45-52` |
| Status | Open |
**Description:** `DriverResilienceOptions.Resolve` indexes the tier-default dictionary directly (`defaults[capability]`) with no fallback. Any future addition to `DriverCapability` that is not also added to all three tier tables in `GetTierDefaults` will make `Resolve` throw `KeyNotFoundException` at runtime on the capability hot path rather than failing at build time. The two are coupled by convention only.
**Recommendation:** Either add a `default` arm to `Resolve` returning a conservative policy (and logging), or add a unit-test invariant asserting every `DriverCapability` value is present in each tier's default table.
**Resolution:** _(open)_
### Core-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs:58-75` |
| Status | Open |
**Description:** `PermissionTrieBuilder.Descend` has a two-branch behaviour: with a `scopePaths` lookup it descends the real hierarchy; without one it falls back to placing every non-cluster row directly under the root keyed by `ScopeId` ("works for deterministic tests, not for production"). The fallback silently produces a structurally incorrect trie when `scopePaths` is null or a row's `ScopeId` is missing — a UnsLine-scoped grant ends up as a direct child of the root, so `WalkEquipment` / `WalkSystemPlatform` never reach it and the grant is effectively dropped, with no diagnostic. There is no test asserting the production multi-level descent versus the fallback.
**Recommendation:** Add unit tests covering `Build` with `scopePaths` producing the correct multi-level trie and the missing-`ScopeId` fallback. Have `Descend` surface a diagnostic (or throw outside test configuration) when a sub-cluster row cannot be located in `scopePaths`.
**Resolution:** _(open)_
### Core-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs:26`, `src/Core/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs:11-22` |
| Status | Open |
**Description:** Two stale doc comments. (1) `WedgeDetector` — the `<summary>` above the constructor reads "Whether the driver reported itself `DriverState.Healthy` at construction." The constructor takes only a `TimeSpan threshold` and the detector is documented as stateless; the comment describes nothing the constructor does. (2) `DriverHealthReport` — the `<remarks>` state matrix lists Unknown, Initializing, Healthy, Degraded, Faulted but `Aggregate` (lines 42-44) also folds `DriverState.Reconnecting` into the Degraded verdict. `Reconnecting` is a real `DriverState` member absent from the documented matrix.
**Recommendation:** Replace the `WedgeDetector` constructor `<summary>` with an accurate description (e.g. "Construct with the wedge-detection threshold; values below 60 s clamp to 60 s"). Add `Reconnecting` to the `DriverHealthReport` `<remarks>` state matrix and state it maps to Degraded.
**Resolution:** _(open)_
+238
View File
@@ -0,0 +1,238 @@
# Code Review — Driver.AbCip.Cli
| Field | Value |
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.AbCip.Cli-001, Driver.AbCip.Cli-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.AbCip.Cli-003 |
| 4 | Error handling & resilience | Driver.AbCip.Cli-001, Driver.AbCip.Cli-004 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.AbCip.Cli-005 |
| 7 | Design-document adherence | Driver.AbCip.Cli-006 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.AbCip.Cli-007 |
| 10 | Documentation & comments | Driver.AbCip.Cli-008 |
## Findings
<!-- One ### entry per finding. IDs are <Module>-NNN, sequential within the module,
never reused. Findings are never deleted — close them by changing Status and
completing Resolution. -->
### Driver.AbCip.Cli-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/WriteCommand.cs:70-85` |
| Status | Resolved |
**Description:** `ParseValue` parses every numeric Logix type with the BCL `*.Parse`
methods (`sbyte.Parse`, `short.Parse`, `int.Parse`, `float.Parse`, ...). These throw
the raw `FormatException` and `OverflowException` on bad operator input. The module's
own test `ParseValue_non_numeric_for_numeric_types_throws` confirms a raw
`FormatException` escapes for `DInt`. Meanwhile the `Bool` branch and the `_ =>`
default branch throw the CLI-friendly `CliFx.Exceptions.CommandException` with an
actionable message. The result is inconsistent operator UX: a typo in a boolean
value prints "Boolean value 'x' is not recognised...", but a typo in a numeric
value (`write -v 12x --type DInt`, or an out-of-range `write -v 99999999999 --type
Int`) escapes uncaught and CliFx renders a full .NET stack trace instead of a
one-line error. CliFx only formats `CommandException` cleanly.
**Recommendation:** Wrap the numeric `*.Parse` calls (or the whole `switch`) in a
`try`/`catch (Exception ex) when (ex is FormatException or OverflowException)` that
rethrows as a `CommandException` with the raw value, the target `--type`, and the
valid range — mirroring the `ParseBool` failure message.
**Resolution:** Resolved 2026-05-22 — wrapped the `ParseValue` switch in `try/catch (FormatException or OverflowException)` that rethrows as `CommandException` with the raw value and type; updated the previously-passing `ParseValue_non_numeric_for_numeric_types_throws` test to assert `CommandException` and added two new tests covering overflow and actionable message content.
### Driver.AbCip.Cli-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/ProbeCommand.cs:21-23`; `Commands/ReadCommand.cs:24-25`; `Commands/SubscribeCommand.cs:20-22` |
| Status | Resolved |
**Description:** `ProbeCommand`, `ReadCommand`, and `SubscribeCommand` expose
`--type` as a free `AbCipDataType` enum option with no exclusion of
`AbCipDataType.Structure`. Only `WriteCommand` rejects `Structure` (with an explicit
`CommandException`). Passing `probe/read/subscribe --type Structure` synthesises a
tag with `DataType = Structure` and no `Members` declared. The driver read path
treats a memberless Structure tag as a black box and routes it to the per-tag
fallback, where `runtime.DecodeValue(AbCipDataType.Structure, ...)` runs with no
declared layout — the operator gets either an opaque value or a confusing status
code rather than the clean "Structure writes need an explicit member layout"
guidance `write` gives. The `read` doc comment even claims "UDT / Structure reads
are out of scope here", but the code does not enforce it.
**Recommendation:** Reject `AbCipDataType.Structure` in `ProbeCommand`,
`ReadCommand`, and `SubscribeCommand` `ExecuteAsync` with the same `CommandException`
pattern `WriteCommand` uses, or factor a shared `RejectStructure(DataType)` guard
into `AbCipCommandBase`.
**Resolution:** Resolved 2026-05-22 — added `RejectStructure(AbCipDataType)` static helper to `AbCipCommandBase` that throws `CommandException` for `Structure`; called at the top of `ExecuteAsync` in `ProbeCommand`, `ReadCommand`, and `SubscribeCommand`.
### Driver.AbCip.Cli-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/SubscribeCommand.cs:50-56,60-61` |
| Status | Open |
**Description:** The `OnDataChange` handler writes change lines to `console.Output`
(a `TextWriter`) from the driver's poll-engine callback thread, while the command's
main flow concurrently writes the "Subscribed to ... Ctrl+C to stop." line on the
CLI thread. `TextWriter.WriteLine` is not guaranteed thread-safe; concurrent writes
from the poll thread and the main thread can interleave or, in the worst case,
corrupt buffered output. The window is small (one main-thread write right after
`SubscribeAsync`) but it exists, and any future addition of main-thread output
during the watch loop widens it.
**Recommendation:** Emit the "Subscribed..." banner before registering the
`OnDataChange` handler (or before `SubscribeAsync`), or guard all `console.Output`
writes during the subscription with a shared lock so poll-thread and main-thread
output cannot interleave.
**Resolution:** _(open)_
### Driver.AbCip.Cli-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/SubscribeCommand.cs:28,58`; `AbCipCommandBase.cs:26-34` |
| Status | Open |
**Description:** `--interval-ms` (`IntervalMs`) is taken verbatim and passed as
`TimeSpan.FromMilliseconds(IntervalMs)` to `SubscribeAsync` with no validation. A
zero or negative value produces a non-positive `TimeSpan`; the option description
claims "PollGroupEngine floors sub-250ms values" but says nothing about `0` or
negatives, and the flooring behaviour is the engine's, not the CLI's — relying on a
downstream component to sanitise operator input is fragile. `--timeout-ms` on
`AbCipCommandBase` has the same gap (a negative value yields a negative `TimeSpan`).
**Recommendation:** Validate `IntervalMs > 0` and `TimeoutMs > 0` at the top of
`ExecuteAsync` / in `AbCipCommandBase`, throwing a `CommandException` with the
accepted range when out of bounds.
**Resolution:** _(open)_
### Driver.AbCip.Cli-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/DriverCommandBase.cs:51-59` |
| Status | Open |
**Description:** `ConfigureLogging` assigns a freshly created Serilog logger to the
process-global `Log.Logger` but never calls `Log.CloseAndFlush()`. For a short-lived
one-shot command (`probe`, `read`, `write`) the process exit flushes the console
sink, so the practical impact is nil. For `subscribe` — a long-running command
terminated by Ctrl+C — buffered log lines emitted just before cancellation can be
lost on abrupt exit. (This lives in the shared `Driver.Cli.Common` base, so it is
noted here as it affects the AB CIP CLI; the canonical fix belongs in that shared
module's review.)
**Recommendation:** Register `Log.CloseAndFlush()` on process exit (e.g. via
`AppDomain.ProcessExit` or a `finally` in the command), or have the CLI use a
disposable logger scoped to `ExecuteAsync`.
**Resolution:** _(open)_
### Driver.AbCip.Cli-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/AbCipCommandBase.cs:29-34` |
| Status | Open |
**Description:** `AbCipCommandBase` overrides the abstract `DriverCommandBase.Timeout`
property with a getter derived from `TimeoutMs` and an empty `init` body
(`init { /* driven by TimeoutMs */ }`). Because the override has no
`[CommandOption]` attribute, CliFx never binds it, so the empty `init` is unreachable
in normal CLI use. However, an empty `init` accessor silently discards any
assignment — if a future caller or test constructs the command via an object
initializer (`new ReadCommand { Timeout = ... }`) the assignment is silently dropped
with no compiler warning. This is a latent correctness trap rather than a current
bug.
**Recommendation:** Either drop the `init` accessor entirely (make the override a
get-only expression-bodied property) or have the empty `init` throw
`NotSupportedException` to make the "driven by TimeoutMs" contract explicit and
fail-fast.
**Resolution:** _(open)_
### Driver.AbCip.Cli-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli.Tests/WriteCommandParseValueTests.cs` |
| Status | Open |
**Description:** The only test file covers `WriteCommand.ParseValue` and
`ReadCommand.SynthesiseTagName` — both pure static helpers. There is no coverage for
`AbCipCommandBase.BuildOptions` (the flag-to-`AbCipDriverOptions` mapping that all
four commands depend on) or `DriverInstanceId`. `BuildOptions` is pure and trivially
unit-testable yet untested: a regression that, say, flipped `EnableAlarmProjection`
back on or dropped `Probe.Enabled = false` would not be caught — and the comment
explicitly warns the probe loop "would race the operator's own reads", so that
mapping is behaviourally load-bearing. The `ExecuteAsync` bodies are reasonably left
untested (they need a fake `AbCipDriver` or hardware), consistent with the other
driver CLIs.
**Recommendation:** Add unit tests asserting `BuildOptions` produces
`Probe.Enabled == false`, `EnableControllerBrowse == false`,
`EnableAlarmProjection == false`, the expected single `AbCipDeviceOptions`
(`HostAddress`, `PlcFamily`, `DeviceName`), the supplied tag list, and the `Timeout`
derived from `TimeoutMs`.
**Resolution:** _(open)_
### Driver.AbCip.Cli-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `docs/Driver.AbCip.Cli.md:8-9` |
| Status | Open |
**Description:** `docs/Driver.AbCip.Cli.md` opens with "Second of four driver
test-client CLIs (Modbus -> AB CIP -> AB Legacy -> S7 -> TwinCAT)." The count "four"
contradicts the chain that follows it (five names) and contradicts
`docs/DriverClis.md`, which documents six CLIs (Modbus, AB CIP, AB Legacy, S7,
TwinCAT, FOCAS). The FOCAS CLI shipped alongside the Tier-C work, so the AB CIP
doc's "four" and the truncated chain are both stale.
**Recommendation:** Update the sentence to "Second of six driver test-client CLIs"
and complete the chain (Modbus -> AB CIP -> AB Legacy -> S7 -> TwinCAT -> FOCAS), or
drop the explicit count and link `docs/DriverClis.md` as the authoritative roster.
**Resolution:** _(open)_
+252
View File
@@ -0,0 +1,252 @@
# Code Review — Driver.AbCip
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.AbCip-001, Driver.AbCip-002, Driver.AbCip-003, Driver.AbCip-004, Driver.AbCip-005 |
| 2 | OtOpcUa conventions | Driver.AbCip-006, Driver.AbCip-007 |
| 3 | Concurrency & thread safety | Driver.AbCip-008, Driver.AbCip-009 |
| 4 | Error handling & resilience | Driver.AbCip-010, Driver.AbCip-011 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.AbCip-012 |
| 7 | Design-document adherence | Driver.AbCip-013 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.AbCip-014 |
| 10 | Documentation & comments | Driver.AbCip-015 |
## Findings
### Driver.AbCip-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `AbCipDriver.cs:111`, `AbCipDriver.cs:163-167` |
| Status | Resolved |
**Description:** `InitializeAsync(string driverConfigJson, ...)` never reads `driverConfigJson`. It builds all device/tag state from `_options`, captured at construction time. `ReinitializeAsync` calls `ShutdownAsync` then `InitializeAsync(driverConfigJson, ...)` and the JSON it is handed is silently discarded. `ReinitializeAsync` is documented (class remarks, lines 18-21) as the Tier-B escape hatch and is the IDriver entry point for picking up changed config. As written, a reinitialize with an updated config JSON (new device, new tag, changed timeout) applies none of the changes; the driver keeps running stale construction-time options. There is no validation that the passed JSON even matches the live options.
**Recommendation:** Either parse `driverConfigJson` inside `InitializeAsync` (re-deriving `AbCipDriverOptions` the way `AbCipDriverFactoryExtensions.CreateInstance` does, so config changes take effect on reinit), or, if config is intentionally immutable for the instance lifetime, document explicitly that AbCip ignores the parameter and assert the JSON is structurally identical to the construction options. Silently discarding it is the worst of both.
**Resolution:** Resolved 2026-05-22 — extracted `AbCipDriverFactoryExtensions.ParseOptions` and `InitializeAsync` now re-parses a content-bearing `driverConfigJson`, replacing `_options` (and recreating the alarm projection) so `ReinitializeAsync` applies config changes; a blank/empty-object JSON keeps construction-time options for the test seam.
### Driver.AbCip-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `AbCipStatusMapper.cs:65-78` |
| Status | Resolved |
**Description:** `MapLibplctagStatus` maps negative libplctag codes that do not match the libplctag.NET `Status` enum / native `libplctag.h` constants. `LibplctagTagRuntime.GetStatus()` returns `(int)_tag.GetStatus()`, the underlying value of the `Status` enum, whose members carry the native `PLCTAG_ERR_*` integer values. The real constants are `PLCTAG_ERR_BAD_CONNECTION = -7` (the only one the code gets right), `PLCTAG_ERR_NOT_FOUND = -18` (code expects -14), `PLCTAG_ERR_NOT_ALLOWED = -19` (code expects -16), `PLCTAG_ERR_OUT_OF_BOUNDS = -22` (code expects -17), `PLCTAG_ERR_TIMEOUT = -32` (code expects -5). Consequently a real timeout, not-found, not-allowed, or out-of-bounds error all fall through the switch to the `_ => BadCommunicationError` default. The driver reports `BadCommunicationError` for a non-existent tag instead of `BadNodeIdUnknown`, for a read-only tag instead of `BadNotWritable`, and for a timeout instead of `BadTimeout`. This defeats the transient-vs-permanent error classification the resilience pipeline relies on.
**Recommendation:** Replace the hand-typed integer literals with the libplctag.NET `Status` enum members (Status.ErrorTimeout, Status.ErrorNotFound, Status.ErrorNotAllowed, Status.ErrorOutOfBounds, Status.ErrorBadConnection, etc.), or at minimum correct the integer values to -32 / -18 / -19 / -22. Map Status.Pending explicitly rather than treating "any positive value" as GoodMoreData.
**Resolution:** Resolved 2026-05-22 — `MapLibplctagStatus` now switches on the libplctag.NET `Status` enum members (Ok/Pending/ErrorTimeout/ErrorNotFound/ErrorNotAllowed/ErrorOutOfBounds/…) instead of hand-typed integers; the `int` overload casts to `Status` so the `GetStatus()` seam stays correct against the wrapper's contiguous renumbering. Note: the live libplctag.NET 1.5.2 `Status` enum is renumbered contiguously, so the correct underlying integers are -32/-19/-18/-27, not the native -32/-18/-19/-22 the finding suggested; switching on the enum members sidesteps the hazard entirely.
### Driver.AbCip-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `AbCipUdtMemberLayout.cs:32-54`, `AbCipDriver.cs:426-430`, `AbCipUdtReadPlanner.cs:48` |
| Status | Resolved |
**Description:** The whole-UDT read path (`ReadGroupAsync`) decodes each grouped member at the byte offset produced by `AbCipUdtMemberLayout.TryBuild`, which computes offsets purely from declaration order of the configured `AbCipStructureMember` list under natural-alignment rules. Logix does not guarantee that the controller lays UDT members out in declaration order: the Studio 5000 compiler reorders members (largest-first packing, BOOL host bytes, nested-struct padding) and the on-wire offsets only come from the CIP Template Object. The class remarks on `AbCipUdtMemberLayout` and `driver-specs.md` both acknowledge this. The decoder for the real shape (`CipTemplateObjectDecoder` / `AbCipTemplateCache`) exists and is populated by `FetchUdtShapeAsync`, but `ReadGroupAsync` never consults it: it always uses the declaration-only layout. For any UDT whose member declaration order in config differs from the controller compiled layout, whole-UDT reads return values decoded from the wrong offsets, silently plausible wrong numbers.
**Recommendation:** In the read planner / `ReadGroupAsync`, prefer the cached `AbCipUdtShape` offsets (from `AbCipTemplateCache` / `FetchUdtShapeAsync`) when available, and only fall back to `AbCipUdtMemberLayout` declaration-order offsets when no template shape can be read. Even then, consider gating the declaration-only fast path behind an explicit opt-in flag, since it is correct only when the operator has hand-verified declaration order matches the controller.
**Resolution:** Resolved 2026-05-22 — the declaration-only whole-UDT grouping fast path is now gated behind the new opt-in `AbCipDriverOptions.EnableDeclarationOnlyUdtGrouping` flag (default `false`); `AbCipUdtReadPlanner.Build` forms no groups when it is off, so by default every UDT member reads per-tag instead of decoding at possibly-wrong declaration-order offsets. The richer CIP Template Object path remains the long-term fix.
### Driver.AbCip-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `AbCipDataType.cs:51-58`, `LibplctagTagRuntime.cs:47-49,53` |
| Status | Resolved |
**Description:** `ToDriverDataType` maps `LInt`/`ULInt` to `DriverDataType.Int32` (a TODO comment notes the gap) and `Dt` to `Int32`. But `LibplctagTagRuntime.DecodeValueAt` returns an actual `long` for `LInt`/`ULInt` (`_tag.GetInt64`, `(long)_tag.GetUInt64`). The address space is built declaring an Int32 node while the driver hands the server a boxed `long` `DataValueSnapshot.Value` at runtime: a mismatch between the declared OPC UA data type and the runtime value type. For `LInt` values exceeding Int32.MaxValue there is data loss if any consumer narrows it. `UDInt` is declared Int32 but decoded as `(int)_tag.GetUInt32`, so values above int.MaxValue wrap to negative.
**Recommendation:** Either add Int64/UInt32/UInt64 to `DriverDataType` and map correctly, or, until that lands, decode `LInt`/`ULInt` consistently with the declared `Int32` type (and document the truncation), and decode `UDInt` as a value that fits Int32 semantics. The declared type and the runtime value type must agree.
**Resolution:** Resolved 2026-05-22 — `ToDriverDataType` now maps `LInt``Int64`, `ULInt``UInt64`, `UDInt``UInt32` (all already present in `DriverDataType`); `DecodeValueAt` updated to return `uint`/`ulong` for UDInt/ULInt respectively so the declared type and runtime value agree. The `(int)` and `(long)` casts that caused truncation/wrap are removed.
### Driver.AbCip-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `AbCipDriver.cs:124-141` |
| Status | Resolved |
**Description:** In `InitializeAsync`, when a `Structure` tag declares `Members`, the loop registers each fanned-out member into `_tagsByName` but the parent Structure tag itself is also left in `_tagsByName` (added at line 125 before the member check). A subsequent `ReadAsync` of the parent name routes through `ReadSingleAsync` then `DecodeValue(AbCipDataType.Structure, ...)` which returns `null` with `Good` status. A client reading the parent UDT node thus gets a Good/null snapshot rather than a fault or a structured value. Also, member registration does not check for name collisions: if two configured tags produce the same parent-dot-member key (or a member name collides with an independently-declared tag), the later silently overwrites the earlier with no diagnostic.
**Recommendation:** Decide the parent-Structure read contract explicitly: either do not register the bare parent name as a readable tag, or have the Structure read return a proper status. Add a duplicate-key check during `_tagsByName` population that throws an `InvalidOperationException` naming both colliding tags, consistent with the fail-fast validation `AbCipHostAddress` parsing already does.
**Resolution:** Resolved 2026-05-22 — The parent Structure tag remains in `_tagsByName` so the whole-UDT grouping planner (Driver.AbCip-003 fast path) and alarm projection can still find it. `ReadSingleAsync` now detects a direct read of a Structure-with-Members and returns `BadNotSupported` instead of Good/null, documenting that callers must address individual member paths. Both scalar and member fan-out registration perform a duplicate-key check that throws `InvalidOperationException` naming both colliding entries (fail-fast, consistent with `AbCipHostAddress` validation).
### Driver.AbCip-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | OtOpcUa conventions |
| Location | `PlcTagHandle.cs:28-59`, `AbCipDriver.cs:806-807,832-833`, `LibplctagTagRuntime.cs:117` |
| Status | Resolved |
**Description:** `driver-specs.md` makes the SafeHandle-wrapped native handle a non-negotiable Tier-B protection ("Wrap every libplctag handle in a SafeHandle with finalizer calling plc_tag_destroy"). The repo ships `PlcTagHandle : SafeHandle` for this, but it is dead code: `ReleaseHandle` is a permanent no-op (the comment says the `plc_tag_destroy` P/Invoke "is deferred to PR 3", well past the commit under review), and `DeviceState.TagHandles` is never populated anywhere in the driver. The real native lifetime is delegated to the libplctag.NET `Tag` object own `Dispose()`. The mandated finalizer-backed leak protection therefore does not exist: if a `LibplctagTagRuntime` is GC-collected without `Dispose` (owning thread crashes, exception bypasses the device dispose path), whether the native tag is freed depends entirely on whether libplctag.NET `Tag` has its own finalizer, which is not guaranteed by this driver code as the design requires.
**Recommendation:** Either delete `PlcTagHandle` and `DeviceState.TagHandles` as misleading dead scaffolding and document that native lifetime is owned by libplctag.NET `Tag` finalizer (verifying that `Tag` actually has one), or finish the intended design by making `LibplctagTagRuntime` hold a real `PlcTagHandle` with a working `ReleaseHandle` calling `plc_tag_destroy`.
**Resolution:** Resolved 2026-05-22 — `PlcTagHandle.cs` deleted; `DeviceState.TagHandles` removed from `DeviceState`; its `DisposeHandles` loop cleaned up. The class-level doc comment on `AbCipDriver` updated to document that native lifetime is owned by libplctag.NET `Tag.Dispose()` (called in `DisposeHandles`) with the library's own finalizer covering GC-collected instances. The two dead-code test methods for `PlcTagHandle` removed from `AbCipDriverTests`.
### Driver.AbCip-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `AbCipDriver.cs` (whole file), `AbCipAlarmProjection.cs`, `LibplctagTagRuntime.cs` |
| Status | Open |
**Description:** `CLAUDE.md` Library Preferences mandate Serilog with a rolling daily file sink. The driver has no logging at all: no `ILogger`/Serilog dependency is injected or used. Failure paths instead swallow exceptions into the `_health` string (`ReadSingleAsync`, `WriteAsync`, `FetchUdtShapeAsync` catch-all, `ProbeLoopAsync` empty catch, `AbCipAlarmProjection.RunPollLoopAsync` empty catch). An operator looking at server logs sees nothing for a probe loop failing every tick for hours, a template decode that silently returned null, or an alarm poll loop throwing every interval. The health surface carries only the last error message, so a transient error immediately overwrites a more important earlier one.
**Recommendation:** Inject an `ILogger` (Serilog) and log at least device init failures, per-call read/write transport errors (debounced), probe-loop failures, template-read failures, and alarm-poll-loop exceptions. The health surface is for state, not for the audit trail.
**Resolution:** _(open)_
### Driver.AbCip-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `AbCipDriver.cs:144-152`, `AbCipDriver.cs:169-183`, `AbCipDriver.cs:235-281` |
| Status | Resolved |
**Description:** Probe loops are started fire-and-forget (`_ = Task.Run(() => ProbeLoopAsync(state, ct), ct)`) and the resulting Task is never stored or awaited. `ShutdownAsync` cancels `state.ProbeCts`, then immediately disposes it, sets it null, and calls `state.DisposeHandles()` without waiting for `ProbeLoopAsync` to observe the cancellation and exit. Races: (1) the still-running probe loop may be mid-await against a `ProbeCts` that `ShutdownAsync` has already disposed, producing `ObjectDisposedException` on the loop thread; (2) `DisposeHandles` clears `Runtimes`/`ParentRuntimes` while a concurrent `ReadAsync`/`WriteAsync` from the alarm projection or a subscription poll could be iterating or adding to those plain `Dictionary` instances (not thread-safe), corrupting the dictionary or throwing; (3) the probe runtime created inside `ProbeLoopAsync` is never tracked by `DeviceState`, so `DisposeHandles` cannot dispose it; only the loop own finally does, which may run after `ShutdownAsync` returns.
**Recommendation:** Store each probe Task on `DeviceState`; in `ShutdownAsync` cancel the CTS, then await Task.WhenAll (with a timeout) before disposing the CTS or the handles. Guard `Runtimes`/`ParentRuntimes` with a lock or switch to `ConcurrentDictionary`. Make `ShutdownAsync` idempotent and safe against in-flight `ReadAsync`/`WriteAsync`.
**Resolution:** Resolved 2026-05-22 — each probe loop's `Task` is stored on `DeviceState.ProbeTask`; `ShutdownAsync` now runs three phases (cancel every CTS, then await each probe Task with a 10s timeout, then dispose the CTS + handles) so the loop never touches a disposed CTS or cleared dictionary. `DeviceState.Runtimes` / `ParentRuntimes` are now `ConcurrentDictionary`, and `EnsureTagRuntimeAsync` / `EnsureParentRuntimeAsync` use `TryAdd` and dispose the losing concurrent creator instead of leaking it. `ShutdownAsync` stays idempotent (a second call sees the cleared `_devices`).
### Driver.AbCip-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `AbCipDriver.cs:621-648`, `AbCipDriver.cs:591-614` |
| Status | Resolved |
**Description:** `EnsureTagRuntimeAsync` and `EnsureParentRuntimeAsync` are check-then-act on a non-thread-safe `Dictionary` (`device.Runtimes` / `device.ParentRuntimes`). `ReadAsync` is `IReadable` and may be invoked concurrently: the server read path, each polled subscription loop, and the alarm projection poll loop all call `ReadAsync` independently. Two concurrent `ReadAsync` calls that both miss the cache for the same tag both create a `LibplctagTagRuntime`, both initialize it, and both write into the dictionary; the loser leaks an initialized native tag (never disposed, since only the dictionary value is disposed at shutdown), and concurrent `Dictionary` mutation can throw or corrupt the bucket structure. `WriteBitInDIntAsync` serializes the parent via a per-parent `SemaphoreSlim`, but `EnsureParentRuntimeAsync` still runs the same unguarded check-then-act on the shared `ParentRuntimes` dict.
**Recommendation:** Use `ConcurrentDictionary` for `Runtimes` and `ParentRuntimes`, creating the runtime via `GetOrAdd` with a lazily-initialized factory, or guard the ensure path with a per-device lock / `SemaphoreSlim`. Ensure the losing creator runtime is disposed rather than leaked.
**Resolution:** Resolved 2026-05-22 — already addressed as part of the Driver.AbCip-008 fix: `DeviceState.Runtimes` and `ParentRuntimes` were switched to `ConcurrentDictionary`; both `EnsureTagRuntimeAsync` and `EnsureParentRuntimeAsync` use the `TryGetValue` → create → `TryAdd` → dispose-loser pattern so concurrent callers that both miss the cache produce exactly one live runtime and the losing creator is disposed rather than leaked.
### Driver.AbCip-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `AbCipDriver.cs:621-648`, `AbCipDriver.cs:346-391` |
| Status | Resolved |
**Description:** Once `EnsureTagRuntimeAsync` successfully creates and initializes a `LibplctagTagRuntime`, that runtime is cached for the lifetime of the device and never re-created on failure. If the underlying native tag enters a permanently-bad state (connection dropped, controller rebooted, tag handle invalidated by a PLC program download), every subsequent `ReadAsync`/`WriteAsync` reuses the same dead handle and returns errors forever. The probe loop does tear down and recreate its runtime after a failure, but the read/write path has no equivalent recovery; only a full `ReinitializeAsync` (itself broken, see Driver.AbCip-001) clears the cache. The normal data path should self-heal from a transient handle fault without operator-driven reinitialize.
**Recommendation:** On a non-zero libplctag status or transport exception in `ReadSingleAsync`/`ReadGroupAsync`/`WriteAsync`, evict the offending runtime from `device.Runtimes` (and dispose it) so the next call re-creates and re-initializes it. Mirror the probe loop recreate-on-failure behavior.
**Resolution:** Resolved 2026-05-22 — added `EvictRuntime(device, tagName)` helper that calls `ConcurrentDictionary.TryRemove` + disposes the evicted instance; called from `ReadSingleAsync`, `ReadGroupAsync`, and `WriteAsync` on both non-zero libplctag status and transport exceptions (type/value-conversion exceptions are not transport faults and do not evict). The next read/write for the affected tag re-runs `EnsureTagRuntimeAsync`, which creates and initializes a fresh handle, mirroring the probe loop's recreate-on-failure behaviour.
### Driver.AbCip-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `AbCipDriver.cs:144-152`, `AbCipDriverOptions.cs:131-143` |
| Status | Open |
**Description:** `InitializeAsync` only starts probe loops when `_options.Probe.Enabled` is true AND `Probe.ProbeTagPath` is non-blank. When `Probe.Enabled` is true (the default) but `ProbeTagPath` is null (also the default; the doc comment says "PR 8 wires this up"), no probe runs at all and the device `HostState` stays `HostState.Unknown` forever. `GetHostStatuses()` then reports every device as Unknown indefinitely with no warning. An operator who enables the probe but does not set a probe tag gets a silently inert health surface rather than an error or a log line.
**Recommendation:** When `Probe.Enabled` is true but no `ProbeTagPath` is configured, either fail initialization with a clear message, fall back to a family-default probe tag (the doc comment stated intent), or at minimum log a warning that the probe is enabled-but-inert.
**Resolution:** _(open)_
### Driver.AbCip-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `LibplctagTemplateReader.cs:15-35`, `AbCipDriver.cs:88-92` |
| Status | Open |
**Description:** `LibplctagTemplateReader` is created per `FetchUdtShapeAsync` call, and each call constructs a fresh libplctag `Tag` for the @udt pseudo-tag, initializes it (a CIP connection handshake), reads, and disposes it. There is no reuse of the `Tag` across template reads for the same device: every UDT shape fetch pays a full connect/init cost. `AbCipTemplateCache` caches the decoded shape so this only bites on the first fetch of each type, but discovery of a UDT-heavy controller still does one connect per type. The same per-call `Tag` construction applies to `LibplctagTagEnumerator`.
**Recommendation:** Acceptable for a low-frequency discovery path, but consider pooling/reusing a single @udt-capable `Tag` per device for the duration of a discovery run, or document that the per-type connect cost is accepted.
**Resolution:** _(open)_
### Driver.AbCip-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `AbCipDriverOptions.cs:70-73`, `PlcFamilies/AbCipPlcFamilyProfile.cs:13-19`, `LibplctagTagRuntime.cs:16-27` |
| Status | Open |
**Description:** `driver-specs.md` specifies the AB CIP per-device connection settings as discrete fields: Host, Path, PlcType, TimeoutMs, AllowPacking, ConnectionSize. The implementation instead collapses host + path into a single opaque ab:// URL string and exposes `PlcFamily` (which adds GuardLogix, not in the spec table). AllowPacking and ConnectionSize from the spec are not configurable per device: `AbCipPlcFamilyProfile` hard-codes `SupportsRequestPacking` and `DefaultConnectionSize` per family, and `LibplctagTagRuntime` never passes a connection-size or packing attribute to the `Tag` (it is constructed with only Gateway/Path/PlcType/Protocol/Name/Timeout). The family profile `DefaultConnectionSize`/`SupportsRequestPacking`/`MaxFragmentBytes` fields are computed but never applied to the wire layer: dead configuration.
**Recommendation:** Either update `driver-specs.md` to describe the actual ab:// host-address model and the family-profile approach, and wire the profile ConnectionSize/packing values through to the libplctag `Tag` attributes; or expose AllowPacking/ConnectionSize as per-device options per the spec.
**Resolution:** _(open)_
### Driver.AbCip-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipStatusMapperTests.cs:28-40` |
| Status | Resolved |
**Description:** `AbCipStatusMapperTests.MapLibplctagStatus_maps_known_codes` asserts the mapper against the same wrong integer constants (-5, -7, -14, -16, -17) the production code uses (see Driver.AbCip-002). The test locks in the bug rather than catching it, giving false confidence that libplctag error mapping is correct. There is no test that drives an actual libplctag `Status` enum value through `LibplctagTagRuntime.GetStatus()` plus `MapLibplctagStatus` end-to-end. Separately, the broken `ReinitializeAsync` config-discard behavior (Driver.AbCip-001) and the declaration-order whole-UDT decode risk (Driver.AbCip-003) have no test that would fail when those defects are present: `AbCipDriverWholeUdtReadTests` only exercises a UDT whose declaration order happens to match a simple alignment layout.
**Recommendation:** Rewrite the libplctag-status test to use the real `libplctag.Status` enum members and their documented integer values. Add a test that `ReinitializeAsync` with a changed config JSON actually applies the change (or asserts the documented immutability contract). Add a whole-UDT decode test where the controller compiled layout differs from declaration order.
**Resolution:** Resolved 2026-05-22 — status mapper test already uses real `Status` enum members (fixed with Driver.AbCip-002); `ReinitializeAsync` config-change coverage already added with Driver.AbCip-001. Added to `AbCipDriverCodeReviewRegressionTests`: three tests for 004 (LInt/ULInt/UDInt type-mapping theory + UDInt decoded-as-uint assertion), three tests for 005 (Structure parent read returns BadNotSupported, duplicate scalar key throws, member-collision-with-independent-tag throws), and one test for 010 (eviction on bad status means next read creates a fresh handle). `AbCipDriverTests.AbCipDataType_maps_atomics_to_driver_types` extended with LInt/ULInt/UDInt assertions.
### Driver.AbCip-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `AbCipDriver.cs:9-11`, `PlcTagHandle.cs:23-27,53-58`, `AbCipTemplateCache.cs:12-15`, `IAbCipTagEnumerator.cs:6-11`, `AbCipDriverOptions.cs:21` |
| Status | Open |
**Description:** Numerous comments are stale relative to the commit under review. `AbCipDriver.cs:9-11` says the driver "Implements IDriver only for now" with capabilities shipping "in subsequent PRs (3-8)" while the class already implements all of them. `PlcTagHandle.cs` says the plc_tag_destroy P/Invoke "is deferred to PR 3 ... PR 2 ships the lifetime scaffold + tests only" and `ReleaseHandle` "is a no-op", which now reads as a permanent unfinished-work marker (see Driver.AbCip-006). `AbCipTemplateCache.cs:12-15` says "Template shape read ... lands with PR 6 ... no reader writes to it yet" while `CipTemplateObjectDecoder` and `LibplctagTemplateReader` both exist and `FetchUdtShapeAsync` writes to the cache. `IAbCipTagEnumerator.cs:6-11` says the enumerator "Defaults to EmptyAbCipTagEnumeratorFactory" while the production default is `LibplctagTagEnumeratorFactory`. `AbCipDriverOptions.cs:21` says "AB discovery lands in PR 5", already shipped. `StyleGuide.md` explicitly says not to leave stale coming-soon notes.
**Recommendation:** Sweep the module for PR-N forward references and "lands in PR X" notes that have been delivered; update them to describe present behavior. Where a comment marks genuinely unfinished work (e.g. `PlcTagHandle.ReleaseHandle`), convert it to a tracked TODO with an issue reference rather than a PR-number milestone.
**Resolution:** _(open)_
@@ -0,0 +1,213 @@
# Code Review — Driver.AbLegacy.Cli
| Field | Value |
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Cli` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.AbLegacy.Cli-001, Driver.AbLegacy.Cli-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.AbLegacy.Cli-003 |
| 4 | Error handling & resilience | Driver.AbLegacy.Cli-001, Driver.AbLegacy.Cli-004 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | Driver.AbLegacy.Cli-005 |
| 8 | Code organization & conventions | Driver.AbLegacy.Cli-006 |
| 9 | Testing coverage | Driver.AbLegacy.Cli-007 |
| 10 | Documentation & comments | Driver.AbLegacy.Cli-002, Driver.AbLegacy.Cli-005 |
## Findings
### Driver.AbLegacy.Cli-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `Commands/WriteCommand.cs:46`, `Commands/WriteCommand.cs:62-72` |
| Status | Resolved |
**Description:** `WriteCommand.ExecuteAsync` calls `ParseValue(Value, DataType)` at
line 46, *before* the `try` block and outside any catch. `ParseValue` uses
`short.Parse` / `int.Parse` / `float.Parse`, which throw `FormatException` on
malformed input (`-v abc`) and `OverflowException` on out-of-range input
(`-t Int -v 99999`). Only the `Bit` branch and the unsupported-type branch raise
the CliFx `CommandException` that the framework renders as a clean one-line error
with a non-zero exit code. For every numeric type a bad `--value` therefore
escapes as an unhandled `FormatException`/`OverflowException`, which CliFx prints
as a raw stack trace — an operator-hostile failure mode for a tool whose whole
purpose is ad-hoc operator use. The module own test
`ParseValue_non_numeric_for_numeric_types_throws` confirms the raw `FormatException`
leaks. The driver `WriteAsync` has dedicated catch arms for `FormatException`
(`BadTypeMismatch`) and `OverflowException` (`BadOutOfRange`), but the CLI never
reaches the driver because the parse happens first.
**Recommendation:** Wrap the numeric parses so a parse failure surfaces as a
`CliFx.Exceptions.CommandException` with a message naming the offending value and
type (mirroring the existing `Bit` and unsupported-type branches). Either catch
`FormatException`/`OverflowException` inside `ParseValue` and rethrow as
`CommandException`, or use `TryParse` and throw `CommandException` on failure.
**Resolution:** Resolved 2026-05-22 — wrapped numeric parses in `ParseValue` with `try/catch` for `FormatException`/`OverflowException`, rethrowing as `CommandException` with a message naming the offending value and type; updated test to assert `CommandException` and added overflow regression test.
### Driver.AbLegacy.Cli-002
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `Commands/WriteCommand.cs:27-29`, `Program.cs:6-9` |
| Status | Open |
**Description:** The `--value` option help text states "booleans accept
true/false/1/0", but `ParseBool` (`WriteCommand.cs:74-80`) and the error message
also accept `on/off` and `yes/no`, and `DriverClis.md` documents the full
`true/false/1/0/yes/no/on/off` set as the shared CLI contract. The help text
under-documents the accepted aliases, so an operator reading `--help` will not
discover `on`/`off`/`yes`/`no`. Minor, but it makes the inline help inconsistent
with both the code and the design doc.
**Recommendation:** Extend the `--value` description to list the full alias set,
matching the wording used elsewhere (e.g. "booleans accept
true/false, 1/0, on/off, yes/no").
**Resolution:** _(open)_
### Driver.AbLegacy.Cli-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `Commands/SubscribeCommand.cs:47-53` |
| Status | Open |
**Description:** The `OnDataChange` handler calls `console.Output.WriteLine(line)`
(the synchronous overload) directly from the `PollGroupEngine` poll thread. The
poll engine raises change events from a background timer/loop thread, so two
ticks that fire close together can interleave writes on the shared `TextWriter`.
`SnapshotFormatter` builds the whole line into a single string before the call,
so a line is unlikely to be torn mid-token, but there is no synchronisation
guaranteeing that the background-thread writes do not interleave with the
`await console.Output.WriteLineAsync(...)` "Subscribed to ..." line on the command
thread, nor with each other. This is the same pattern as the AbCip CLI, so it is
a shared low-severity issue, not unique to this module.
**Recommendation:** Serialise console writes from the event handler — e.g. funnel
change events through a `Channel<string>` drained by a single consumer task, or
guard the `WriteLine` with a lock. At minimum, document that the interleaving is
accepted because output is human-facing and line-buffered.
**Resolution:** _(open)_
### Driver.AbLegacy.Cli-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `Commands/ProbeCommand.cs:37-56`, `Commands/ReadCommand.cs:39-50`, `Commands/WriteCommand.cs:48-59`, `Commands/SubscribeCommand.cs:41-76` |
| Status | Open |
**Description:** Every command does `await using var driver = new AbLegacyDriver(...)`
*and* an explicit `await driver.ShutdownAsync(...)` in the `finally`. `AbLegacyDriver`
`DisposeAsync` itself calls `ShutdownAsync`, so the driver is shut down twice on the
normal path. `ShutdownAsync` is written to be idempotent (it clears `_devices` /
`_tagsByName` and re-enters cleanly on an empty state), so this is not a crash, but
the double teardown is redundant and slightly obscures intent — a reader has to
confirm idempotency to be sure it is safe. The `await using` already guarantees
cleanup on every exit path including exceptions.
**Recommendation:** Drop either the `await using` or the explicit
`finally { await driver.ShutdownAsync(...) }` in each command. Keeping the explicit
`finally` and using a plain `var driver` (no `await using`) is the clearer choice,
since the commands deliberately pass `CancellationToken.None` to shutdown so teardown
is not cut short by a cancelled `ct`.
**Resolution:** _(open)_
### Driver.AbLegacy.Cli-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `Commands/SubscribeCommand.cs:23-25`, `docs/Driver.AbLegacy.Cli.md:94-96` |
| Status | Open |
**Description:** The subscribe command interval option is `--interval-ms`
(default 1000). `docs/Driver.AbLegacy.Cli.md` shows the subscribe example as
`otopcua-ablegacy-cli subscribe ... -i 500`, which works because of the short
alias `'i'`, but the doc never names the long form `--interval-ms` or states the
1000 ms default, while the equivalent AbCip CLI help text notes "PollGroupEngine
floors sub-250ms values". The AbLegacy `--interval-ms` description omits that
flooring caveat, so an operator passing `-i 100` against AbLegacy gets no warning
that the engine will floor it. The behaviour is identical (same `PollGroupEngine`)
but the documented contract drifts between the two CLIs.
**Recommendation:** Add the sub-250 ms flooring note to the AbLegacy
`--interval-ms` description for parity with the AbCip CLI, and mention the
`--interval-ms` long form + 1000 ms default in `docs/Driver.AbLegacy.Cli.md`.
**Resolution:** _(open)_
### Driver.AbLegacy.Cli-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `Commands/ProbeCommand.cs:20-22` |
| Status | Open |
**Description:** `ProbeCommand` declares its `--type` option with no short alias,
while `ReadCommand`, `WriteCommand`, and `SubscribeCommand` all declare `--type`
with the short alias `'t'`. `ProbeCommand` also gives `--address` the alias `'a'`,
matching the other commands, so the `--type` omission is an inconsistency rather
than a deliberate design choice. An operator who learns `-t` on `read` will find
it silently rejected on `probe`.
**Recommendation:** Add the `'t'` short alias to `ProbeCommand` `--type` option
for consistency with the other three commands. (The AbCip CLI `ProbeCommand` has
the same omission, so a cross-CLI sweep is worthwhile.)
**Resolution:** _(open)_
### Driver.AbLegacy.Cli-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Cli.Tests/WriteCommandParseValueTests.cs` |
| Status | Open |
**Description:** The only test file in the CLI test project covers
`WriteCommand.ParseValue` and `ReadCommand.SynthesiseTagName`. Two behaviours that
are pure logic (testable without a device) are uncovered:
(1) `AbLegacyCommandBase.BuildOptions` — that it sets `Probe.Enabled = false`,
populates `Devices` from `Gateway`/`PlcType`, and forwards the tag list; a
regression here silently changes every command behaviour.
(2) the out-of-range numeric path for `ParseValue` (`short.Parse` overflow,
`int.Parse` overflow) — `ParseValue_non_numeric_for_numeric_types_throws` asserts
`FormatException` for non-numeric input but nothing asserts the overflow path,
which is exactly the path that escapes uncaught per finding
Driver.AbLegacy.Cli-001. `BuildOptions` is reachable via `InternalsVisibleTo`
(the test assembly is already granted access).
**Recommendation:** Add tests for `BuildOptions` (probe disabled, device shape,
tag passthrough) and an overflow-input test for `ParseValue` so the fix for
Driver.AbLegacy.Cli-001 is locked in by a regression test.
**Resolution:** _(open)_
+365
View File
@@ -0,0 +1,365 @@
# Code Review - Driver.AbLegacy
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 3 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.AbLegacy-001, Driver.AbLegacy-002, Driver.AbLegacy-003, Driver.AbLegacy-004 |
| 2 | OtOpcUa conventions | Driver.AbLegacy-005 |
| 3 | Concurrency & thread safety | Driver.AbLegacy-006, Driver.AbLegacy-007, Driver.AbLegacy-008 |
| 4 | Error handling & resilience | Driver.AbLegacy-009, Driver.AbLegacy-010 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.AbLegacy-011 |
| 7 | Design-document adherence | Driver.AbLegacy-012 |
| 8 | Code organization & conventions | Driver.AbLegacy-013 |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | No issues found |
## Findings
### Driver.AbLegacy-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `AbLegacyAddress.cs:54`, `AbLegacyDriver.cs:368-374` |
| Status | Resolved |
**Description:** `AbLegacyAddress.TryParse` accepts a `BitIndex` of `0..31` for every
file type. A PCCC N-file word is a signed 16-bit integer, so valid bit indices are
`0..15`. When a tag is `Bit`-typed against an N-file with a bit suffix of `16..31`
(e.g. `N7:0/20`), `WriteBitInWordAsync` reads the parent as `AbLegacyDataType.Int`
(16-bit), then computes `current | (1 << bit)` / `current & ~(1 << bit)` with `bit`
up to 31. `1 << 20` produces a value outside the 16-bit range, the result is cast
`(short)updated`, and the high bits are silently truncated - the wrong bit (or no
bit) is written and no error is surfaced. The mask arithmetic is also done on a
sign-extended `int`. For L-file (32-bit) bits the parent is still read as `Int`
(16-bit), so bits 16..31 of a long can never be addressed correctly.
**Recommendation:** Validate `BitIndex` against the parent word width during parse or
in `WriteBitInWordAsync` - reject bit > 15 for N/B/I/O/S files and bit > 31 for L
files. For bit-in-word RMW against L files, read the parent as `Long`. Mask the
read-back value to the word width before applying the bit operation.
**Resolution:** Resolved 2026-05-22 — `AbLegacyAddress.TryParse` now range-checks the
bit index against per-file word width (0..15 for N/B/I/O/S/A, 0..31 for L, no bits on
F); `WriteBitInWordAsync` reads/writes an L-file parent as 32-bit `Long` and masks the
RMW arithmetic to the native width so sign-extension can no longer corrupt high bits.
### Driver.AbLegacy-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `AbLegacyDriver.cs:368` |
| Status | Resolved |
**Description:** In `WriteBitInWordAsync` the parent word is decoded with
`Convert.ToInt32(parentRuntime.DecodeValue(AbLegacyDataType.Int, ...))`.
`LibplctagLegacyTagRuntime.DecodeValue` for `AbLegacyDataType.Int` returns
`(int)_tag.GetInt16(0)` - a sign-extended `int`. When the current word has its high
bit set (value 0x8000..0xFFFF, decoded as a negative `int`), the subsequent
`(short)updated` cast re-encodes the low 16 bits correctly, but `current | (1 << bit)`
is performed on the sign-extended value. The result is bit-correct for the low 16
bits only because the cast preserves them; any future change to widen the mask range
will break silently. Combined with finding 001 this is a latent correctness hazard.
**Recommendation:** Mask `current` to `current & 0xFFFF` before the bit operation and
operate on an explicitly 16-bit value, or document the reliance on low-16-bit
preservation explicitly.
**Resolution:** Resolved 2026-05-22 — `current & widthMask` already applied in `WriteBitInWordAsync` by the -001 fix; no additional change needed.
### Driver.AbLegacy-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `AbLegacyAddress.cs:62-95` |
| Status | Resolved |
**Description:** `TryParse` does not reject several malformed PCCC addresses that the
XML docs imply are invalid:
- A sub-element and a bit index together (`T4:0.ACC/2`) parse successfully even
though no PCCC element supports both.
- I/O/S files with a file number (`I3:0`, `S2:1`) parse successfully - I/O and S are
single-letter files with no file number per the doc comment, but the parser only
requires "letter then optional digits".
- B-file addresses with a sub-element (`B3:0.DN`) parse successfully.
`ToLibplctagName()` re-emits whatever was parsed, so a malformed address is passed
through to libplctag rather than rejected early with a clear error.
**Recommendation:** Tighten the parser: reject sub-element + bit-index combinations,
reject file numbers on I/O/S, and restrict which file letters may carry a sub-element
(T/C/R only). Add unit coverage for the rejection cases.
**Resolution:** Resolved 2026-05-22 — `TryParse` now rejects sub-element+bit-index combinations, file numbers on I/O/S files, and sub-elements on non-T/C/R files; unit tests added in `AbLegacyAddressTests`.
### Driver.AbLegacy-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `LibplctagLegacyTagRuntime.cs:36-37` |
| Status | Resolved |
**Description:** `DecodeValue` for `AbLegacyDataType.Bit` with `bitIndex == null`
returns `_tag.GetInt8(0) != 0`. A bit-file element (`B3:0/0`) is a single bit inside
a 16-bit word; reading only the low byte (`GetInt8(0)`) means a `Bit` tag whose live
bit sits in bits 8..15 of the word, or a B-file element addressed without an explicit
bit suffix, decodes incorrectly. The driver passes `parsed.ToLibplctagName()` which
preserves the `/bit` suffix, so libplctag resolves the bit when a suffix is present -
but a `Bit`-typed tag configured with an address that has no `/bit` suffix (e.g.
`B3:0`) silently decodes the wrong thing.
**Recommendation:** For `Bit` with no `bitIndex`, decide explicitly: either require a
bit suffix on `Bit`-typed tags (validate in `CreateInstance`/`DiscoverAsync`) or
decode the full 16-bit word and test bit 0.
**Resolution:** Resolved 2026-05-22 — `DecodeValue` for `Bit` with no `bitIndex` now reads the full 16-bit word via `GetInt16(0)` and tests bit 0, avoiding the silent half-word truncation from `GetInt8`.
### Driver.AbLegacy-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `AbLegacyDriver.cs` (whole file) |
| Status | Open |
**Description:** The driver uses no `ILogger`/Serilog at all. Probe-loop failures,
runtime initialisation failures, libplctag non-zero statuses, and read/write
exceptions are folded into `DriverHealth.Detail` strings but never logged. CLAUDE.md
names Serilog with a rolling daily file sink as the logging library. The complete
absence of structured logging makes field diagnosis of a PCCC comms problem (timeout
vs route failure vs wrong PLC family) rely entirely on a single overwritten `Detail`
string that the next read or write immediately clobbers.
**Recommendation:** Inject `ILogger<AbLegacyDriver>` (optional, like `tagFactory`) and
log probe transitions, runtime-init failures, and the first occurrence of a non-zero
libplctag status per device.
**Resolution:** _(open)_
### Driver.AbLegacy-006
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `AbLegacyDriver.cs:107-158`, `AbLegacyDriver.cs:162-234`, `LibplctagLegacyTagRuntime.cs` |
| Status | Resolved |
**Description:** A per-tag `IAbLegacyTagRuntime` (wrapping a single libplctag `Tag`)
is cached in `DeviceState.Runtimes` and reused. `ReadAsync` (called directly by the
server read path) and the `PollGroupEngine` poll loop (which also calls `ReadAsync`
via the reader delegate) can run concurrently, and two poll subscriptions covering
the same tag run on independent background tasks. All of them call
`EnsureTagRuntimeAsync` to the same `Tag` instance and call `runtime.ReadAsync` /
`GetStatus` / `DecodeValue` with no synchronisation. A libplctag `Tag` is not safe
for concurrent operations on the same handle: an interleaved Read/GetStatus/DecodeValue
from two threads can read a value mid-update or observe a status that belongs to the
other operation. `WriteAsync` shares the same runtime dictionary and compounds the
hazard. Only the bit-in-word RMW path is serialised (per-parent `SemaphoreSlim`).
**Recommendation:** Serialise all operations against a given runtime - a per-runtime
`SemaphoreSlim`, or a per-device read lock - so no two threads touch the same `Tag`
handle concurrently.
**Resolution:** Resolved 2026-05-22 — added a per-runtime `SemaphoreSlim`
(`DeviceState.GetRuntimeLock`, keyed by tag name); `ReadAsync` and `WriteAsync` now
hold it around the whole Read→GetStatus→Decode / Encode→Write→GetStatus sequence so the
shared libplctag `Tag` handle is never touched by two threads at once.
### Driver.AbLegacy-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `AbLegacyDriver.cs:411-438`, `AbLegacyDriver.cs:386-409` |
| Status | Resolved |
**Description:** `EnsureTagRuntimeAsync` and `EnsureParentRuntimeAsync` are
check-then-act: `device.Runtimes.TryGetValue(...)` then, after `await
runtime.InitializeAsync`, `device.Runtimes[def.Name] = runtime`. `Dictionary` is not
thread-safe, and two concurrent callers for the same tag (read + poll, or two poll
loops) both miss the lookup, both Create + InitializeAsync a runtime, and both write
the dictionary. One runtime is overwritten and leaked - `DisposeRuntimes` only
disposes what is currently in the dict - and concurrent `Dictionary` writes can
corrupt internal state. `ParentRuntimes` has the identical pattern.
**Recommendation:** Replace the runtime caches with `ConcurrentDictionary` and use
`GetOrAdd`, or guard runtime creation under a per-device lock. Ensure the losing
runtime of any race is disposed.
**Resolution:** Resolved 2026-05-22 — `Runtimes` and `ParentRuntimes` changed to `ConcurrentDictionary`; `EnsureTagRuntimeAsync` and `EnsureParentRuntimeAsync` now hold a per-key `GetCreationLock` semaphore around the double-checked create+initialize+store sequence so exactly one runtime is created per key and no race-loser is leaked.
### Driver.AbLegacy-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `AbLegacyDriver.cs:21`, `AbLegacyDriver.cs:138-146`, `AbLegacyDriver.cs:216-229` |
| Status | Resolved |
**Description:** `_health` is a plain non-volatile reference field mutated from
`ReadAsync`, `WriteAsync` (both can run on multiple threads / poll loops) and
`InitializeAsync`/`ShutdownAsync`, and read by `GetHealth()` from yet another thread.
There is no lock, no `volatile`, and no `Interlocked` exchange. The record reference
assignment is atomic, but without a memory barrier a reader can observe a stale
`_health` indefinitely, and concurrent writers race so a `Healthy` write from one
successful read can clobber a `Degraded` write from a concurrent failing read.
`GetHealth()` may therefore report `Healthy` while reads are persistently failing.
**Recommendation:** Mark `_health` volatile, or funnel health transitions through a
lock / `Interlocked.Exchange`. Consider only downgrading on failure and upgrading on a
successful poll so a single failed read does not flap the surface.
**Resolution:** Resolved 2026-05-22 — `_health` marked `volatile`; memory barrier comment documents the acquire/release ordering guarantee.
### Driver.AbLegacy-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `AbLegacyDriver.cs:41-74` |
| Status | Resolved |
**Description:** `InitializeAsync` starts probe loops with `Task.Run` inside the try
block. If `InitializeAsync` fails - or is re-entered - after some probe loops are
already started, the catch only sets `_health = Faulted` and rethrows; it does not
cancel `state.ProbeCts`, dispose runtimes, or clear `_devices`. A caller that catches
the exception and retries via `ReinitializeAsync` is covered (it calls `ShutdownAsync`
first), but a caller that catches and abandons the driver leaves orphaned probe tasks
and `CancellationTokenSource`s alive holding libplctag handles. Separately,
`ProbeLoopAsync` never escalates a permanently-unreachable device beyond `Stopped`.
**Recommendation:** On the catch path in `InitializeAsync`, run the same teardown as
`ShutdownAsync` (cancel probe CTSs, dispose runtimes, clear dictionaries) before
rethrowing, so a failed initialise leaves no live background work.
**Resolution:** Resolved 2026-05-22 — `InitializeAsync` catch block now cancels and disposes probe CTSs, calls `DisposeRuntimes`, and clears `_devices`/`_tagsByName` before rethrowing, leaving no orphaned background tasks or handles.
### Driver.AbLegacy-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `AbLegacyStatusMapper.cs:26-56` |
| Status | Resolved |
**Description:** `MapLibplctagStatus` maps the integer codes -5/-7/-14/-16/-17. These
do not match the native libplctag PLCTAG_ERR_* constants (PLCTAG_ERR_TIMEOUT = -32,
PLCTAG_ERR_NOT_FOUND = -22, PLCTAG_ERR_NOT_ALLOWED = -21, PLCTAG_ERR_OUT_OF_BOUNDS =
-25, PLCTAG_ERR_BAD_CONNECTION = -8). The mapper operates on `(int)_tag.GetStatus()`,
where `GetStatus()` returns the libplctag .NET wrapper Status enum whose underlying
ordinals differ from the native codes - so the -5/-7/... values are at best the .NET
enum ordinals (unverified, undocumented) and at worst wrong. Any unmatched negative
status falls through to `BadCommunicationError`, so a timeout is reported as a generic
comms error rather than `BadTimeout`. `MapPcccStatus` is dead code - the PCCC STS byte
is never inspected because libplctag surfaces only its own status enum.
**Recommendation:** Verify the actual `libplctag.Status` enum values against the 1.5.2
package and map by enum name rather than magic integers. Either wire `MapPcccStatus`
into a real PCCC-STS path or delete it as dead code. The same defect exists in
`AbCipStatusMapper` and should be fixed in lockstep.
**Resolution:** Resolved 2026-05-22 — `MapLibplctagStatus` now casts to `libplctag.Status` and switches on named enum members (matching the AbCip mapper pattern); `MapPcccStatus` retained with a comment documenting it as a reference mapping for future PCCC-STS inspection; tests updated to use `Status` enum members.
### Driver.AbLegacy-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `AbLegacyDriver.cs:440` |
| Status | Open |
**Description:** `Dispose()` is implemented as
`DisposeAsync().AsTask().GetAwaiter().GetResult()` - sync-over-async. `ShutdownAsync`
awaits `_poll.DisposeAsync()` (which completes synchronously) and does no other real
async work, so a deadlock is unlikely in practice, but the pattern blocks the calling
thread and would deadlock if any awaited continuation were ever marshalled back to a
single-threaded synchronization context.
**Recommendation:** Prefer callers use `IAsyncDisposable`. If a synchronous `Dispose()`
must exist, perform the synchronous teardown directly (cancel CTSs, dispose runtimes)
rather than blocking on the async path.
**Resolution:** _(open)_
### Driver.AbLegacy-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `PlcFamilies/AbLegacyPlcFamilyProfile.cs:7-54`, `AbLegacyDriver.cs:48-52` |
| Status | Resolved |
**Description:** `AbLegacyPlcFamilyProfile` declares four record properties -
`DefaultCipPath`, `MaxTagBytes`, `SupportsStringFile`, `SupportsLongFile` - and only
`LibplctagPlcAttribute` is ever consumed. In particular:
- `DefaultCipPath` is dead: the per-family default path (empty for MicroLogix, 1,0
for SLC/PLC-5) is never used to substitute an empty CIP path. The CIP path always
comes verbatim from `AbLegacyHostAddress.CipPath`, so a SLC 500 misconfigured with
an empty path is never corrected to 1,0 even though the profile knows the right
default - contradicting the test-fixture doc, which calls out the /1,0 cip-path
workaround as required for SLC.
- `MaxTagBytes` is never used to validate or chunk a string/array read.
- `SupportsStringFile`/`SupportsLongFile` are never checked, so a `String` or `Long`
tag configured against a MicroLogix or PLC-5 (which the profile says lack them) is
accepted and only fails at runtime with an opaque comms error.
**Recommendation:** Either consume the profile fields (substitute `DefaultCipPath` when
the host CIP path is empty; reject `Long`/`String` tags against families whose profile
sets the corresponding flag false; use `MaxTagBytes` for validation) or remove the
unused fields and the doc comments that imply they are load-bearing.
**Resolution:** Resolved 2026-05-22 — `DeviceState.EffectiveCipPath` applies `DefaultCipPath` when the parsed host address has an empty CIP path; `InitializeAsync` validates `Long`/`String` tag types against `SupportsLongFile`/`SupportsStringFile` and throws early; `MaxTagBytes` tracked as a follow-up (string/array chunking requires broader design work).
### Driver.AbLegacy-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `AbLegacyDriver.cs:340-345`, `AbLegacyDriver.cs:238-264` |
| Status | Open |
**Description:** Two minor organisational issues:
1. `ResolveHost` returns `_options.Devices.FirstOrDefault()?.HostAddress ??
DriverInstanceId` when the reference is unknown and no devices are configured.
`DriverInstanceId` is not a host address (ab://...), so a downstream
`IHostConnectivityProbe` / host lookup keyed on the returned value never matches a
real device. Returning the instance id as a fake host masks a configuration error.
2. `DiscoverAsync` always emits `IsArray: false` / `ArrayDim: null`. PCCC files are
inherently arrays of elements; a tag that genuinely addresses a multi-element
region cannot be represented. This is consistent with the PR-staged scope (the doc
says array coverage is thin) but should be tracked rather than silently shipped.
**Recommendation:** For (1), either throw / return a sentinel the caller can detect, or
document why falling back to the instance id is acceptable. For (2), record the
array-addressing gap as a tracked follow-up.
**Resolution:** _(open)_
+197
View File
@@ -0,0 +1,197 @@
# Code Review — Driver.Cli.Common
| Field | Value |
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 2 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Cli.Common-001, Driver.Cli.Common-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.Cli.Common-003 |
| 4 | Error handling & resilience | Driver.Cli.Common-004 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Cli.Common-005 |
| 10 | Documentation & comments | Driver.Cli.Common-006 |
## Findings
### Driver.Cli.Common-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:106-119` |
| Status | Resolved |
**Description:** The `FormatStatus` shortlist maps four OPC UA status names to incorrect
numeric codes. The correct OPC UA spec values (verified against the OPC Foundation
UA-.NETStandard `Opc.Ua.StatusCodes` table) are:
| Name in shortlist | Code used | Correct code | What the used code actually is |
|---|---|---|---|
| `BadTimeout` | `0x80060000` | `0x800A0000` | `0x80060000` = `BadOutOfMemory` |
| `BadNoCommunication` | `0x80070000` | `0x80310000` | `0x80070000` = `BadResourceUnavailable` |
| `BadWaitingForInitialData` | `0x80080000` | `0x80320000` | `0x80080000` is not this name |
| `BadNodeIdInvalid` | `0x80350000` | `0x80330000` | `0x80350000` = `BadNodeClassInvalid` |
`Good` (`0x00000000`), `Bad` (`0x80000000`), `BadCommunicationError` (`0x80050000`),
`BadNodeIdUnknown` (`0x80340000`), `BadTypeMismatch` (`0x80740000`), and `Uncertain`
(`0x40000000`) are correct.
This is operator-facing and load-bearing: the CLI whole purpose is to label driver
status codes so a human can interpret a probe/read/write. A real device timeout
(`0x800A0000`) renders as bare `0x800A0000` with no name, while an out-of-memory
status (`0x80060000`) is mislabeled `BadTimeout`. A driver returning
`BadNodeClassInvalid` (`0x80350000`) is mislabeled `BadNodeIdInvalid`. The
`SnapshotFormatterTests` `[Theory]` cases for these codes assert against the wrong
expectations and therefore pass while the mapping is wrong (see Driver.Cli.Common-005).
**Recommendation:** Correct the four mappings to the spec values. Prefer deriving names
from the OPC Foundation `Opc.Ua.StatusCodes` constants (the stack the project already
depends on transitively) rather than hand-maintaining a hex shortlist, so the table
cannot drift from the spec again. If a hand-list is kept, add a test that cross-checks
each entry against `Opc.Ua.StatusCodes` reflection.
**Resolution:** Resolved 2026-05-22 — corrected the four mismapped `FormatStatus` codes
to their canonical `Opc.Ua.StatusCodes` values (`BadTimeout` 0x800A0000, `BadNoCommunication`
0x80310000, `BadWaitingForInitialData` 0x80320000, `BadNodeIdInvalid` 0x80330000); the CLI
project does not reference the `Opc.Ua` package so the hex literals were corrected in place
with a sync note, and `SnapshotFormatterTests` was updated with corrected expectations plus
a regression `[Theory]` asserting the pre-fix wrong names no longer apply.
### Driver.Cli.Common-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:101-122` |
| Status | Resolved |
**Description:** `FormatStatus` matches the full 32-bit status word for exact equality
against the shortlist. OPC UA status codes carry sub-code/flag bits in the low 16 bits
(info type, structure-changed, semantics-changed, limit bits, overflow, etc.). A
driver-supplied status such as `0x80050001` or any `Good` value with info bits set
(e.g. an overflow bit) falls through the `switch` and renders as bare hex even though
the high bits clearly identify the severity class. The doc comment on `FormatStatus`
claims the well-known statuses are named, but only the bit-exact canonical forms are.
**Recommendation:** Either (a) narrow the doc-comment claim to bit-exact canonical
codes, or (b) match on the severity bits (`code & 0xC0000000`) to at least always emit
`Good` / `Uncertain` / `Bad` even when sub-code bits are set, and match the named codes
on the masked code (`code & 0xFFFF0000`).
**Resolution:** Resolved 2026-05-22 — `FormatStatus` now matches named codes on `code & 0xFFFF0000` and falls back to a severity-class label (`Good`/`Uncertain`/`Bad`) via `code & 0xC0000000` for unknown sub-codes; the stale "bare-hex for unknown codes" test expectation was corrected to reflect the new severity-class fallback.
### Driver.Cli.Common-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/DriverCommandBase.cs:51-59` |
| Status | Resolved |
**Description:** `ConfigureLogging` assigns the process-global `Serilog.Log.Logger`
without disposing the previously assigned logger and the library never calls
`Log.CloseAndFlush()`. Each call creates a fresh `Logger` via `CreateLogger()` and
overwrites `Log.Logger`; the prior instance (and its console sink) is never disposed
or flushed. The class is the shared base for every driver CLI and the `subscribe` verb
is long-running — if any command path re-invokes `ConfigureLogging` the buffered
console sink is abandoned without a flush, and on process exit the final logger is also
never flushed. Verbose debug output written just before exit can be lost.
**Recommendation:** Call `Log.CloseAndFlush()` on shutdown (e.g. in a `finally` in the
command `ExecuteAsync`, or via a `protected` disposal helper on this base). Treat
`ConfigureLogging` as call-once / idempotent and document that. At minimum capture and
dispose the previous logger if reconfiguration is genuinely intended.
**Resolution:** Resolved 2026-05-22 — `ConfigureLogging` is now idempotent (guarded by `_loggingConfigured` field) and disposes the previous `Log.Logger` before overwriting; a new `protected static FlushLogging()` helper calls `Log.CloseAndFlush()` for commands to call in their `finally` blocks; XML doc updated accordingly.
### Driver.Cli.Common-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:68-70` |
| Status | Open |
**Description:** `FormatTable` calls `rows.Max(r => r.Tag.Length)` (and the same for the
value and status columns) without guarding against empty input. When `tagNames` and
`snapshots` are both empty (equal length, so the mismatch check at line 56 passes),
`Enumerable.Max` throws `InvalidOperationException` ("Sequence contains no elements").
A batch read that legitimately returns zero tags therefore crashes the formatter
instead of producing an empty (header-only) table.
**Recommendation:** Short-circuit on `rows.Length == 0` (return just the header +
separator, or an explicit "no rows" line), or use `DefaultIfEmpty(0).Max(...)` for the
width computations.
**Resolution:** _(open)_
### Driver.Cli.Common-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common.Tests/SnapshotFormatterTests.cs:27-37` |
| Status | Resolved |
**Description:** The `FormatStatus_names_well_known_status_codes` `[Theory]` asserts
`0x80060000 => "BadTimeout"`, which encodes the wrong spec value (see
Driver.Cli.Common-001). The test passes because it validates the formatter against the
same incorrect table, so the bug is invisible to CI. Additionally there is no coverage
for: `DriverCommandBase` (`ConfigureLogging` verbose vs non-verbose level selection — no
test exercises the base at all), `FormatTable` with empty input (Driver.Cli.Common-004
would have been caught), `FormatValue` with array / enum / custom `object` values, and
`FormatTimestamp` with `DateTimeKind.Unspecified` (the docs imply Unspecified is
normalised but only `Local` is tested).
**Recommendation:** Fix the `[Theory]` expectations once Driver.Cli.Common-001 is
resolved, and add a test asserting each shortlist entry against the OPC Foundation
`Opc.Ua.StatusCodes` constants so the table cannot silently drift. Add `FormatTable`
empty-input and `DriverCommandBase` level-selection tests.
**Resolution:** Resolved 2026-05-22 — added `FormatTable_with_empty_input_returns_header_only` (exercises the -004 fix), `FormatStatus_with_sub_code_bits_resolves_to_named_class` / `FormatStatus_unknown_sub_code_falls_back_to_severity_class` Theories (cover -002 fix), and a new `DriverCommandBaseTests` class with four tests covering verbose/non-verbose level selection, idempotency of `ConfigureLogging`, and `FlushLogging`; stale `FormatStatus_unknown_codes_fall_back_to_hex_only` expectation corrected to match the -002 severity-class fallback.
### Driver.Cli.Common-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:71`, `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/DriverCommandBase.cs:9` |
| Status | Open |
**Description:** Two minor doc inaccuracies. (1) The comment at `SnapshotFormatter.cs:71`
states the "source-time column is fixed-width (ISO-8601 to ms) so no max-measurement
needed" — true only when every snapshot has a non-null `SourceTimestampUtc`.
`FormatTimestamp` returns `"-"` for a null timestamp, so a mixed table has a 1-char-wide
cell in an otherwise 24-char column; the column is unaligned. Harmless (right-most, no
padding consumer) but the stated invariant does not hold. (2) The `DriverCommandBase`
class summary enumerates "Modbus / AB CIP / AB Legacy / S7 / TwinCAT" as the driver CLIs
but omits FOCAS, which `docs/DriverClis.md` lists as the sixth CLI built on this shared
library. The XML doc is stale relative to the shipped driver-CLI set.
**Recommendation:** Reword the `SnapshotFormatter.cs:71` comment to note the column is
right-most and intentionally unpadded rather than claiming fixed width. Add FOCAS to the
`DriverCommandBase` class-summary driver list.
**Resolution:** _(open)_
+183
View File
@@ -0,0 +1,183 @@
# Code Review — Driver.FOCAS.Cli
| Field | Value |
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Cli` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.FOCAS.Cli-001 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.FOCAS.Cli-002 |
| 4 | Error handling & resilience | Driver.FOCAS.Cli-001, Driver.FOCAS.Cli-003 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.FOCAS.Cli-004 |
| 7 | Design-document adherence | Driver.FOCAS.Cli-005 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found (see note) |
| 10 | Documentation & comments | No issues found |
> Category 9 note: per `docs/DriverClis.md` the FOCAS CLI deliberately ships
> with no CLI-level test project (hardware-gated, followed the Tier-C isolation
> work on task #220). The four command classes are thin pass-throughs to the
> already-tested `FocasDriver`; the only CLI-local logic is `ParseValue` /
> `ParseBool` / `SynthesiseTagName`, which the sibling CLIs cover with unit
> tests. The absence of a `*.Cli.Tests` project is an intentional, documented
> gap rather than a review finding — but see Driver.FOCAS.Cli-001 for the parse
> path that would benefit most from coverage.
## Findings
### Driver.FOCAS.Cli-001
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `Commands/WriteCommand.cs:58-68` |
| Status | Open |
**Description:** `WriteCommand.ParseValue` parses the numeric `--value` types
(`Byte`/`Int16`/`Int32`/`Float32`/`Float64`) with `sbyte.Parse` / `short.Parse`
/ etc. These throw raw `FormatException` or `OverflowException` for malformed or
out-of-range input. Only the `Bit` case and the unsupported-type case throw
`CliFx.Exceptions.CommandException`. CliFx renders a `CommandException` as a
clean one-line error, but an uncaught `FormatException`/`OverflowException`
surfaces as a full .NET stack trace — a poor experience for an operator who
simply mistyped a value (e.g. `write -a R100 -t Int16 -v abc`). The parse
failure occurs before any driver work, so the redundant stack trace also
obscures that the write never reached the CNC.
**Recommendation:** Wrap the numeric parses (e.g. via `TryParse` per type, or a
`try`/`catch` that rethrows as `CommandException`) so malformed `--value` input
produces a clean, actionable message naming the expected type and the rejected
literal — consistent with how `ParseBool` already handles bad boolean input.
The same pattern exists in the sibling S7 CLI; a shared helper in
`Driver.Cli.Common` would fix both.
**Resolution:** _(open)_
### Driver.FOCAS.Cli-002
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `Commands/SubscribeCommand.cs:45-51` |
| Status | Open |
**Description:** The `subscribe` command attaches an `OnDataChange` handler that
calls the synchronous `console.Output.WriteLine`. `OnDataChange` is raised from
the driver's `PollGroupEngine` tick thread, while the command's main flow writes
the "Subscribed to ..." banner from the CliFx invocation thread. The CliFx
`IConsole.Output` `TextWriter` is not documented as thread-safe; with a single
poll group the change events are serialised, but the banner write at line 55-56
can interleave with the first poll-driven change line. The handler is also never
detached from the event before driver disposal — benign here because the driver
is disposed in the same `finally`, but it leaves a dangling subscription if the
command is ever refactored to reuse the driver.
**Recommendation:** Write the "Subscribed" banner before wiring the
`OnDataChange` handler (it is informational and ordering-sensitive), or guard
console writes with a lock shared between the banner and the handler. Optionally
detach the handler in the `finally` block before `ShutdownAsync` for symmetry
with the `handle` teardown already present there.
**Resolution:** _(open)_
### Driver.FOCAS.Cli-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `FocasCommandBase.cs:19` (`CncPort`), `FocasCommandBase.cs:27` (`TimeoutMs`), `Commands/SubscribeCommand.cs:23` (`IntervalMs`) |
| Status | Open |
**Description:** The numeric command options `--cnc-port`, `--timeout-ms`, and
`--interval-ms` are accepted without range validation. A zero or negative
`--cnc-port` produces an invalid `focas://host:<n>` string; `--timeout-ms 0`
yields a zero `TimeSpan` operation timeout; a zero/negative `--interval-ms`
produces a non-positive `publishingInterval` passed straight into
`PollGroupEngine.Subscribe`. Depending on the engine tolerance these surface
either as an opaque downstream exception or as a tight-spinning poll loop rather
than a clear "value must be positive" message at the CLI boundary.
**Recommendation:** Validate the three numeric options at the top of
`ExecuteAsync` (or in `FocasCommandBase`) and throw a
`CliFx.Exceptions.CommandException` when out of range — port in `1..65535`,
timeout and interval strictly positive. The same gap exists across the sibling
driver CLIs, so a shared validation helper in `Driver.Cli.Common` is the
cleaner fix.
**Resolution:** _(open)_
### Driver.FOCAS.Cli-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Commands/ProbeCommand.cs:37,54`; `Commands/ReadCommand.cs:37,46`; `Commands/WriteCommand.cs:45,54`; `Commands/SubscribeCommand.cs:39,73` |
| Status | Open |
**Description:** Every command declares `await using var driver = new FocasDriver(...)`
**and** explicitly calls `await driver.ShutdownAsync(CancellationToken.None)` in
the `finally` block. `FocasDriver.DisposeAsync()` itself calls `ShutdownAsync`,
so shutdown runs twice per command invocation. `FocasDriver.ShutdownAsync` is
idempotent (it clears `_devices` / `_tagsByName`, and the second pass iterates
an empty collection), so there is no functional bug — but the redundant call is
dead weight and obscures intent: a reader cannot tell whether the explicit
`ShutdownAsync` or the `await using` is the real teardown.
**Recommendation:** Drop the explicit `ShutdownAsync` from the `finally` blocks
and rely on `await using` for disposal, or drop `await using` and keep the
explicit teardown — but not both. The same redundancy exists in the sibling CLIs.
**Resolution:** _(open)_
### Driver.FOCAS.Cli-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `Commands/WriteCommand.cs:50`, `Commands/ProbeCommand.cs:50` (via `SnapshotFormatter.FormatStatus`) |
| Status | Open |
**Description:** `docs/Driver.FOCAS.Cli.md` documents `BadDeviceFailure` and
`BadCommunicationError` as the key diagnostic signals an operator reads off
`probe` / `write` output ("A `BadCommunicationError` means ... `BadDeviceFailure`
after a successful connect means ..."). The FOCAS driver `FocasStatusMapper`
also emits `BadNotWritable` (0x803B0000), `BadOutOfRange` (0x803C0000),
`BadNotSupported` (0x803D0000), `BadDeviceFailure` (0x80550000),
`BadInternalError` (0x80020000), and `BadTimeout` (0x800A0000). The shared
`SnapshotFormatter.FormatStatus` shortlist only names `Good`, `Bad`,
`BadCommunicationError`, `BadTimeout` (0x80060000 — note this is a *different*
code than the mapper `BadTimeout` 0x800A0000), `BadNoCommunication`,
`BadWaitingForInitialData`, `BadNodeIdUnknown`, `BadNodeIdInvalid`,
`BadTypeMismatch`, and `Uncertain`. Consequently a FOCAS `write` to a
non-writable address, a parameter-write rejected by the CNC, or a
`BadDeviceFailure` session-setup rejection renders as a bare hex code
(`0x803B0000`, `0x80550000`, …) with no name — directly contradicting the
documented workflow where the operator is told to read those status names.
**Recommendation:** Extend `SnapshotFormatter.FormatStatus` (in
`Driver.Cli.Common`) to name the `Bad*` codes the native-protocol drivers
actually emit — at minimum `BadNotWritable`, `BadOutOfRange`, `BadNotSupported`,
`BadDeviceFailure`, `BadInternalError`, and the mapper `BadTimeout`
(0x800A0000). The fix belongs in the shared library, but it is recorded here
because the gap defeats this module documented `probe`/`write` diagnostic
workflow; cross-reference the `Driver.Cli.Common` review.
**Resolution:** _(open)_
+330
View File
@@ -0,0 +1,330 @@
# Code Review — Driver.FOCAS
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.FOCAS-001, Driver.FOCAS-002, Driver.FOCAS-003 |
| 2 | OtOpcUa conventions | Driver.FOCAS-004 |
| 3 | Concurrency & thread safety | Driver.FOCAS-005 |
| 4 | Error handling & resilience | Driver.FOCAS-006, Driver.FOCAS-007 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.FOCAS-008 |
| 7 | Design-document adherence | Driver.FOCAS-009 |
| 8 | Code organization & conventions | Driver.FOCAS-010, Driver.FOCAS-011 |
| 9 | Testing coverage | Driver.FOCAS-012 |
| 10 | Documentation & comments | No issues found |
## Findings
### Driver.FOCAS-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `FocasDriverFactoryExtensions.cs:54-86`, `FocasDriverFactoryExtensions.cs:132-140` |
| Status | Resolved |
**Description:** `FocasDriverConfigDto` exposes only `Backend`, `Series`, `TimeoutMs`,
`Devices`, `Tags`, and `Probe`. It has no `FixedTree`, `AlarmProjection`, or
`HandleRecycle` properties, and `CreateInstance` never sets those three options on
`FocasDriverOptions`. As a result, a deployment that follows the documented config -
`docs/drivers/FOCAS.md` shows `"FixedTree": { "Enabled": true }`,
`"AlarmProjection": { "Enabled": true }`, and `"HandleRecycle": { "Enabled": true }`
inside `Config` - is parsed with `PropertyNameCaseInsensitive` and the unknown sections
are discarded. The features stay at their hard-coded defaults (all `Enabled = false`).
The fixed-node tree never appears, alarm subscriptions throw `NotSupportedException`
("FOCAS alarm projection is disabled"), and handle recycling never runs - despite the
operator explicitly opting in.
**Recommendation:** Add `FixedTree`, `AlarmProjection`, and `HandleRecycle` DTO classes
to `FocasDriverConfigDto`, parse their `TimeSpan`/`bool` fields, and populate the
corresponding `FocasDriverOptions` properties in `CreateInstance`. Consider enabling
strict JSON handling (`UnmappedMemberHandling.Disallow`) so future unknown config
sections fail loudly instead of being dropped.
**Resolution:** Resolved 2026-05-22 — added `FixedTreeDto`/`AlarmProjectionDto`/`HandleRecycleDto` to `FocasDriverConfigDto` and `Build*` mappers in `CreateInstance` that populate the matching `FocasDriverOptions` properties (missing section / field keeps its default).
### Driver.FOCAS-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `WireFocasClient.cs:164-179`, `FocasDriver.cs:513`, `FocasDriver.cs:593` |
| Status | Resolved |
**Description:** The fixed-tree bootstrap probes the `ProgramInfo` capability via
`SafeTryProbe(() => client.GetProgramInfoAsync(ct))` and treats a non-null result as
"supported". But `WireFocasClient.GetProgramInfoAsync` never throws on a FOCAS error
return code: `ReadExecutingProgramNameAsync`, `ReadBlockCountAsync`, and
`ReadOperationModeCodeAsync` all return `FocasResult<T>` envelopes, and the method
substitutes defaults (`string.Empty`, `0`) when `IsOk` is false instead of throwing. It
only throws from `RequireConnected()`. Consequently `GetProgramInfoAsync` always
returns a non-null `FocasProgramInfo`, so `Capabilities.ProgramInfo` is set `true` even
on a CNC series that returns `EW_FUNC`/`EW_NOOPT` for `cnc_exeprgname2`/`cnc_rdopmode`.
The driver then emits the `Program/` and `OperationMode/` subtrees and polls them every
tick against a controller that does not support them - the exact "nodes that only ever
return BadDeviceFailure" outcome the capability suppression was designed to prevent
(`docs/drivers/FOCAS.md`, "Per-series node suppression").
**Recommendation:** Make `GetProgramInfoAsync` throw (or return a nullable result) when
the underlying `cnc_exeprgname2` / `cnc_rdopmode` calls report a non-zero RC, so
`SafeTryProbe` can correctly classify the series. At minimum require the program-name
or op-mode read to be `IsOk` before declaring the capability present.
**Resolution:** Resolved 2026-05-22 — `WireFocasClient.GetProgramInfoAsync` now throws `InvalidOperationException` when neither the `cnc_exeprgname2` nor the `cnc_rdopmode` read is `IsOk`, so `SafeTryProbe` records `ProgramInfo` as unsupported on series that answer `EW_FUNC`/`EW_NOOPT`.
### Driver.FOCAS-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `FocasDriver.cs:71-79` |
| Status | Resolved |
**Description:** In `InitializeAsync`, capability-matrix validation only runs when
`_devices.TryGetValue(tag.DeviceHostAddress, out var device)` succeeds. A tag whose
`DeviceHostAddress` does not match any configured device (a common config typo, e.g. a
trailing `:8193` mismatch or a wrong host) silently skips validation and is still added
to `_tagsByName`. The mistake is not surfaced at load time - it only manifests at read
time as `BadNodeIdUnknown` (`ReadAsync` lines 191-194), defeating the documented goal
that "config errors now fail at load instead of per-read"
(`docs/v2/focas-version-matrix.md`).
**Recommendation:** After parsing the tag address, if `_devices` does not contain
`tag.DeviceHostAddress`, throw an `InvalidOperationException` naming the tag and the
unresolved device host so the operator fixes the typo at startup.
**Resolution:** Resolved 2026-05-22 — `InitializeAsync` now throws `InvalidOperationException` naming the tag and the unresolved device when `_devices` does not contain `tag.DeviceHostAddress`, preventing silent skip-and-defer to per-read `BadNodeIdUnknown`.
### Driver.FOCAS-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | OtOpcUa conventions |
| Location | `FocasDriver.cs:374-379`, `WireFocasClient.cs:48-50` |
| Status | Resolved |
**Description:** `DiscoverAsync` emits user tags with
`SecurityClass = tag.Writable ? SecurityClassification.Operate : SecurityClassification.ViewOnly`,
and `FocasTagDefinition.Writable` defaults to `true` (also defaulted to `true` in the
factory - `t.Writable ?? true`). But the production `wire` backend's
`WireFocasClient.WriteAsync` unconditionally returns `FocasStatusMapper.BadNotWritable`
- the driver is read-only against FOCAS by design (`docs/drivers/FOCAS.md`). The result
is that every tag is advertised in the address space as a writable `Operate` node, yet
every write attempt fails. This is misleading to OPC UA clients and to the
`DriverNodeManager` ACL layer, which will grant write permission on nodes that can never
be written.
**Recommendation:** Either default `Writable` to `false` for the FOCAS driver, or have
`DiscoverAsync` force `SecurityClassification.ViewOnly` when the active backend cannot
write. Given the wire backend is read-only and is the only production backend, treating
all FOCAS tags as `ViewOnly` is the simplest correct behaviour.
**Resolution:** Resolved 2026-05-22 — `DiscoverAsync` now unconditionally emits `SecurityClassification.ViewOnly` for all user-authored tags; the `Writable` config field no longer influences the advertised security class since the wire backend never writes.
### Driver.FOCAS-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `FocasDriver.cs:28`, `FocasDriver.cs:206-215`, `FocasDriver.cs:261`, `FocasDriver.cs:274` |
| Status | Resolved |
**Description:** `_health` is a plain (non-volatile) field mutated from multiple
concurrent contexts - `ReadAsync`, `WriteAsync`, and the per-device `ProbeLoopAsync` can
all run on different threads simultaneously (subscriptions go through `PollGroupEngine`
timers; probe loops are `Task.Run`). Several updates are read-modify-write - e.g.
`new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead, ...)` reads `_health`
then writes a new instance - so a concurrent update can be lost or a stale
`LastSuccessfulRead` propagated. While `DriverHealth` is an immutable record and the
reference write is atomic, the lack of synchronization means `GetHealth()` can observe
torn-in-time state and successful-read timestamps can regress.
**Recommendation:** Guard `_health` reads/writes with a lock, or use
`Interlocked.Exchange`/`Volatile` around the whole record reference and compute the new
value from a single captured snapshot. The `DeviceState`/`HostState` transition already
uses `ProbeLock`; apply the same discipline to driver health.
**Resolution:** Resolved 2026-05-22 — All `_health` reads use `Volatile.Read(ref _health)` and all writes use `Volatile.Write(ref _health, ...)`, ensuring every thread observes the latest reference and multi-step read-modify-write sequences capture a stable snapshot before computing the new value.
### Driver.FOCAS-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `FocasDriver.cs:859-874`, `WireFocasClient.cs:22-31` |
| Status | Resolved |
**Description:** `EnsureConnectedAsync` reuses the cached `IFocasClient` instance across
a transient disconnect: it only checks `device.Client is { IsConnected: true }` and
otherwise calls `ConnectAsync` again on the same object. For a `WireFocasClient` whose
underlying `FocasWireClient` has been disposed (e.g. via a `HandleRecycle` /
`DisposeClient` race, or a prior teardown), every subsequent call hits
`FocasWireClient.ThrowIfDisposed` and throws `ObjectDisposedException`. In `ReadAsync`
that exception is caught only by the generic `catch (Exception ex)` and mapped to a
permanent `BadCommunicationError` - the device stays wedged with no recovery path until
`ReinitializeAsync` is invoked, because the reconnect logic never discards the disposed
client.
**Recommendation:** On any connect/use failure, treat a disposed or non-connected client
as unrecoverable and recreate it from `_clientFactory`. Simplest: in
`EnsureConnectedAsync`, when `device.Client` is non-null but not connected, dispose and
null it before creating a fresh instance, rather than retrying `ConnectAsync` on the
stale object.
**Resolution:** Resolved 2026-05-22 — `EnsureConnectedAsync` now unconditionally disposes and nulls any existing non-connected client before calling `_clientFactory.Create()`, preventing `ObjectDisposedException` loops on a stale `WireFocasClient` after a `HandleRecycle` race or prior teardown.
### Driver.FOCAS-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `FocasDriver.cs:140-148`, `FocasDriver.cs:478-484`, `FocasDriver.cs:529-533`, `FocasAlarmProjection.cs:61-63` |
| Status | Open |
**Description:** Numerous `try { ... } catch {}` blocks swallow every exception with no
logging - `ShutdownAsync` (CTS cancel/dispose), `RecycleLoopAsync` (`DisposeClient`),
`FixedTreeLoopAsync` transient catches, `ProbeLoopAsync`, and the alarm projection's
`sub.Cts.Cancel()`. The driver takes no `ILogger` dependency at all (only
`FocasWireClient` optionally accepts one, and the driver never supplies it). A CNC that
is silently failing every probe/poll tick produces no diagnostic trail, which conflicts
with the project's Serilog logging convention and forces field troubleshooting to rely
solely on `GetHealth()`.
**Recommendation:** Inject an `ILogger<FocasDriver>` and log caught exceptions in the
poll/probe/recycle loops at `Debug`/`Warning`. Pass a logger into `FocasWireClient` so
the per-response `Debug` entries it already emits are actually captured.
**Resolution:** _(open)_
### Driver.FOCAS-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `FocasDriver.cs:201`, `FocasDriver.cs:253` |
| Status | Open |
**Description:** `ReadAsync` and `WriteAsync` call `FocasAddress.TryParse(def.Address)`
on every operation, even though `InitializeAsync` already parsed and validated every
tag address. On a subscription hot path (each poll tick re-enters `ReadAsync`) this
re-parses and allocates a `FocasAddress` record per tag per tick unnecessarily.
**Recommendation:** Parse each tag address once at `InitializeAsync` and store the
parsed `FocasAddress` on `FocasTagDefinition` (or in a side dictionary), so the runtime
read/write paths use the cached value.
**Resolution:** _(open)_
### Driver.FOCAS-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `FocasDriverOptions.cs:110-115`, `FocasDriver.cs:468-486`, `FocasDriverFactoryExtensions.cs:75-80` |
| Status | Open |
**Description:** `FocasProbeOptions.Timeout` is parsed by the factory
(`FocasProbeDto.TimeoutMs` to `FocasProbeOptions.Timeout`) but never consumed.
`ProbeLoopAsync` calls `client.ProbeAsync(ct)` with only the probe-loop cancellation
token; no per-probe timeout is applied, and `EnsureConnectedAsync` uses
`_options.Timeout` rather than `Probe.Timeout`. A hung CNC socket during a probe blocks
until the OS TCP timeout rather than the configured `Probe.Timeout`.
**Recommendation:** Apply `Probe.Timeout` as a linked `CancellationTokenSource` timeout
around the `ProbeAsync` call, or remove the dead `Timeout` field from
`FocasProbeOptions` / `FocasProbeDto` if it is genuinely not intended.
**Resolution:** _(open)_
### Driver.FOCAS-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `IFocasClient.cs:210-227` (`FocasOpMode`), `FocasConstants.cs:42-78` (`FocasOperationMode`) |
| Status | Open |
**Description:** There are two parallel operation-mode-to-text mappings with divergent
labels. `FocasOpMode.ToText` (used by the driver fixed-tree `OperationMode/ModeText`
node) yields `"TJOG"`, `"TEACH_IN_HANDLE"`; `FocasOperationModeExtensions.ToText` (in
the Wire layer) yields `"T-JOG"`, `"TEACH-IN-HANDLE"`. They also use different fallback
formats (`Mode{mode}` vs the bare number). The same concept is encoded twice with
inconsistent results depending on which path renders it.
**Recommendation:** Consolidate to a single op-mode enum + `ToText` helper shared by
both the wire layer and the driver projection, with one canonical label set.
**Resolution:** _(open)_
### Driver.FOCAS-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `IFocasClient.cs:275-287` (`FocasAlarmType`), `FocasAlarmProjection.cs:149-175` |
| Status | Open |
**Description:** `FocasAlarmType` declares its constants as `public const int`, but the
only consumers - `FocasAlarmProjection.MapAlarmType(short type)` and
`MapSeverity(short type)` - take a `short` and `switch` against these `int` constants. It
compiles only because the values (0..13) fit in `short` range as constant expressions.
The type mismatch is a latent maintenance hazard: adding a constant above
`short.MaxValue`, or changing the projection signatures, would break the switch in
non-obvious ways. `FocasAlarmType.All` is `-1` and is also passed where a `short` is
expected by `ReadAlarmsAsync`.
**Recommendation:** Declare the `FocasAlarmType` constants as `short` (or make it an
`enum : short`) so the type matches the wire field width and the projection signatures.
**Resolution:** _(open)_
### Driver.FOCAS-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `FocasDriverFactoryExtensions.cs`, `FocasDriver.cs:495-629` (`FixedTreeLoopAsync`) |
| Status | Resolved |
**Description:** The unit test project does not exercise
`FocasDriverFactoryExtensions.CreateInstance` with `FixedTree` / `AlarmProjection` /
`HandleRecycle` config sections - which is why the config-mapping gap in
Driver.FOCAS-001 was not caught. There is also no test that drives the fixed-tree
bootstrap / capability-probe path (`FixedTreeLoopAsync`), so the false-positive
`ProgramInfo` capability in Driver.FOCAS-002 is untested, and the
`EnsureConnectedAsync` reconnect-after-disconnect path (Driver.FOCAS-006) has no
coverage.
**Recommendation:** Add factory tests that round-trip a full JSON config including the
three opt-in sections and assert the options reach the driver; add a
`FakeFocasClient`-driven test for fixed-tree bootstrap capability classification
(including the unsupported-program-info case); add a reconnect test that disposes the
fake client mid-session and asserts recovery.
**Resolution:** Resolved 2026-05-22 — Added `FocasDriverMediumFindingsTests.cs` covering: unknown-DeviceHostAddress init throw (003), ViewOnly enforcement for all tags (004), Volatile `_health` under concurrent reads (005), reconnect-after-external-dispose recovery (006), and a factory full-round-trip test for all three opt-in config sections (012).
+237
View File
@@ -0,0 +1,237 @@
# Code Review — Driver.Galaxy
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 4 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Galaxy-001, Driver.Galaxy-002, Driver.Galaxy-003, Driver.Galaxy-004 |
| 2 | OtOpcUa conventions | Driver.Galaxy-005 |
| 3 | Concurrency & thread safety | Driver.Galaxy-006, Driver.Galaxy-007 |
| 4 | Error handling & resilience | Driver.Galaxy-001, Driver.Galaxy-008, Driver.Galaxy-009 |
| 5 | Security | Driver.Galaxy-010 |
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012 |
| 7 | Design-document adherence | Driver.Galaxy-013 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Galaxy-014 |
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013 |
## Findings
### Driver.Galaxy-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Error handling & resilience |
| Location | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` |
| Status | Resolved |
**Description:** The `ReconnectSupervisor` is constructed in `BuildProductionRuntimeAsync` and exposes `ReportTransportFailure(Exception)` as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls `ReportTransportFailure` (a repo-wide search finds only the declaration). When the gateway `StreamEvents` stream faults, `EventPump.RunAsync` catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and `GetHealth()` keeps reporting `Healthy` because `_supervisor.IsDegraded` stays false. This is a production outage with no self-recovery.
**Recommendation:** Wire the EventPump (and any gw RPC that observes a transport fault) to call `_supervisor.ReportTransportFailure(ex)`. The simplest path: give `EventPump` a fault callback (or expose a `StreamFaulted` event) that `GalaxyDriver` subscribes to and forwards to the supervisor. The supervisor's `ReopenAsync`/`ReplayAsync` must also restart the EventPump itself (see Driver.Galaxy-008).
**Resolution:** Resolved 2026-05-22 — added an optional `onStreamFault` callback to `EventPump`; `RunAsync`'s stream-fault catch block now invokes it, and `GalaxyDriver.EnsureEventPumpStarted` wires it to `OnEventPumpStreamFault` which forwards the cause to `ReconnectSupervisor.ReportTransportFailure`, so a transient gw transport drop now drives reopen → replay. Regression coverage in `EventPumpStreamFaultTests`. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
### Driver.Galaxy-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `Browse/DataTypeMap.cs:13`, `Runtime/MxValueDecoder.cs:9` |
| Status | Resolved |
**Description:** `DataTypeMap.Map` maps Galaxy `mx_data_type` codes to six `DriverDataType` values (Boolean, Int32, Float32, Float64, String, DateTime) — there is no `Int64` arm. Yet `MxValueDecoder` and `MxValueEncoder` both fully support Int64 (`MxValue.Int64Value`, `Int64Array`), and the decoder's own XML doc claims "the seven Galaxy data types ... (Boolean, Int32, Int64, Float32, Float64, String, DateTime)". Any Galaxy attribute whose `mx_data_type` is the Int64 code (or any code > 5) falls through the `_ => DriverDataType.String` default. The address-space node is then created as a `String` variable while runtime reads decode an `Int64` boxed value — a type mismatch that produces wrong OPC UA `DataType`/`ValueRank` metadata and likely fails value coercion at the server node layer.
**Recommendation:** Confirm the Galaxy `mx_data_type` integer code for 64-bit integers and add the explicit arm to `DataTypeMap.Map`. If the wire format genuinely has no Int64 type, correct the `MxValueDecoder`/`MxValueEncoder` doc comments instead. Either way the encoder/decoder and the type map must agree.
**Resolution:** Resolved 2026-05-22 — added `6 => DriverDataType.Int64` to `DataTypeMap.Map`, extending the contiguous 0..5 scheme so the type map covers the same seven Galaxy data types `MxValueDecoder`/`MxValueEncoder` already decode/encode; Int64 attributes now build as Int64 nodes instead of falling through to the String default. Regression coverage in `DataTypeMapTests`.
### Driver.Galaxy-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `Runtime/StatusCodeMap.cs:86` |
| Status | Resolved |
**Description:** `FromMxStatus` returns `Good` whenever `status.Success != 0`. The intent (per the surrounding comment "Honors the success flag") is that a non-zero `Success` means success. But if `MxStatusProxy.Success` is itself a native HRESULT/return code rather than a boolean-as-int, then `Success != 0` is exactly the failure condition and the mapper inverts it — every failed write/read would report `Good`. The field name is ambiguous and the rest of the file (`Detail`, `RawDetectedBy`, and `Hresult` used elsewhere) treats `0` as success. `GatewayGalaxyAlarmAcknowledger.cs:62` uses the opposite convention for the sibling field (`reply.Hresult != 0` means failure).
**Recommendation:** Verify the semantics of `MxStatusProxy.Success` against the gateway proto contract. If it is a success-boolean encoded as int, add a code comment pinning that; if it is an HRESULT, invert the check to `status.Success == 0 => Good`.
**Resolution:** Resolved 2026-05-22 — replaced `status.Success != 0` with `status.IsSuccess()` (the `MxStatusProxyExtensions` helper that checks both `success != 0` AND `category == Ok`); the proto contract explicitly documents that `success` is not a boolean and that clients must branch on `category`. Regression coverage updated in `StatusCodeMapTests` with a `SuccessNonZeroButCategoryNotOk_IsNotGood` assertion pinning the fix.
### Driver.Galaxy-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `GalaxyDriver.cs:901` |
| Status | Resolved |
**Description:** `OnPumpDataChange` reconstructs a raw OPC DA quality byte from an OPC UA `StatusCode` for the probe watcher: it shifts `StatusCode >> 30` and maps `0->192, 1->64, _->0`. The `StatusCode` was itself produced upstream by `StatusCodeMap.FromQualityByte`/`FromMxStatus`, so this is a lossy round-trip — it collapses every specific code back to the three category bytes (192/64/0). That happens to satisfy `PerPlatformProbeWatcher.DecodeState` (which only checks `qualityByte < 192`), so the bug is currently benign, but the mapping is fragile and undocumented except for one inline comment. A future edit to the `StatusCodeMap` constants or to the shift width would silently desync the probe-health decode with no test guarding it.
**Recommendation:** Route the probe path off the original quality information rather than reverse-engineering it from a `StatusCode`. Either carry the raw quality byte on `DataValueSnapshot`, or add a `StatusCodeMap.ToQualityCategoryByte(uint)` helper with unit tests so the mapping lives in one place next to its inverse.
**Resolution:** Resolved 2026-05-22 — added `StatusCodeMap.ToQualityCategoryByte(uint)` helper that extracts top-two bits of the OPC UA StatusCode into the OPC DA category byte (Good=192, Uncertain=64, Bad=0); `GalaxyDriver.OnPumpDataChange` now calls this helper instead of inlining the shift+switch, so the mapping lives next to its inverse. Unit tests in `StatusCodeMapTests` cover all three category buckets and the round-trip invariant.
### Driver.Galaxy-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `Runtime/EventPump.cs:81-88` |
| Status | Open |
**Description:** The `BoundedChannelOptions` comment states "Newest-dropped policy: when full, the producer's TryWrite returns false ... We do this manually rather than relying on `BoundedChannelFullMode.DropWrite`" — but the option is then set to `FullMode = BoundedChannelFullMode.Wait`. With `Wait`, `TryWrite` returning `false` on a full channel is correct behaviour, so the code works, but the comment naming the mode and the actual mode disagree, which is confusing for a maintainer deciding whether the policy is `Wait`, `DropWrite`, or `DropNewest`.
**Recommendation:** Either reword the comment to say "we use `Wait` mode but never call the awaitable `WriteAsync``TryWrite` gives us synchronous newest-dropped semantics", or switch to `BoundedChannelFullMode.DropWrite` and keep the manual drop count. Make the comment and the mode consistent.
**Resolution:** _(open)_
### Driver.Galaxy-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `GalaxyDriver.cs:848-861` |
| Status | Resolved |
**Description:** `OnAlarmFeedTransition` picks the "owner" handle with `_alarmSubscriptions.First()` under `_alarmHandlersLock`. `HashSet<T>.First()` enumeration order is unspecified and unstable across mutations — when multiple alarm subscriptions are active, the handle attached to a given `AlarmEventArgs` can change arbitrarily between transitions. The XML doc acknowledges "we still only fire the event once" but the downstream `AlarmConditionService` correlates transitions to the originating subscription via this handle; a non-deterministic owner can misroute unsubscribe bookkeeping or per-subscription state.
**Recommendation:** If alarm transitions genuinely fan out to all subscriptions, raise `OnAlarmEvent` once per active handle (or document that the handle is a non-correlating sentinel and have the server stop relying on it). If a single owner is required, make the choice deterministic (e.g. the earliest-created handle) and stable.
**Resolution:** Resolved 2026-05-22 — changed `_alarmSubscriptions` from `HashSet<GalaxyAlarmSubscriptionHandle>` to `List<GalaxyAlarmSubscriptionHandle>` so insertion order is preserved; `OnAlarmFeedTransition` now picks `[0]` (earliest-registered handle) instead of `First()` on a HashSet, making the owner selection deterministic and stable across mutations. Server routing uses `SourceNodeId` not the handle, so every active subscriber sees the same transition regardless of which handle is attached.
### Driver.Galaxy-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `GalaxyDriver.cs:937-968` |
| Status | Resolved |
**Description:** `Dispose()` is not synchronized against the capability methods. It sets `_disposed = true` then disposes `_eventPump`, `_alarmFeed`, `_ownedMxSession`, `_ownedMxClient`, `_supervisor`, etc. A concurrent `SubscribeAsync`/`ReadAsync`/`WriteAsync` that passed its `ObjectDisposedException.ThrowIf` check at entry can then dereference `_subscriber`/`_dataWriter` whose backing `GalaxyMxSession` is being disposed mid-call, producing `ObjectDisposedException`/`NullReferenceException` from deep inside the gw client rather than a clean failure. `Dispose` also blocks the caller on `GetAwaiter().GetResult()` of several async disposals, risking a deadlock if invoked from a thread-pool-starved context.
**Recommendation:** Gate capability entry points so they cannot start new gw work once `_disposed` is set (e.g. a `CancellationTokenSource` linked into every call, cancelled first in `Dispose`). Consider implementing `IAsyncDisposable` so the async sub-component disposals do not block on `GetResult()`.
**Resolution:** Resolved 2026-05-22 — added `IAsyncDisposable` to `GalaxyDriver` and implemented `DisposeAsync()` as the primary disposal path that awaits each async sub-component (EventPump, AlarmFeed, MxSession, MxClient, RepositoryClient) without blocking; `Dispose()` delegates to `DisposeAsync().AsTask().GetAwaiter().GetResult()` for `using`-statement compatibility. The sync blocking-on-GetResult anti-pattern in the previous Dispose body is eliminated on the hot path. Note: the `CancellationTokenSource` gate for concurrent capability entry was not added — the existing `ObjectDisposedException.ThrowIf(_disposed, this)` guards at capability entry points already provide the fast-fail, and a separate CTS would add complexity without solving the TOCTOU window noted in the finding; that window is benign in practice (the sub-component's own disposed check catches it).
### Driver.Galaxy-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `GalaxyDriver.cs:264-276`, `Runtime/EventPump.cs:97-103` |
| Status | Resolved |
**Description:** Even if Driver.Galaxy-001 is fixed and the supervisor's `ReplayAsync` runs, recovery is incomplete. `ReplayAsync` re-issues `SubscribeBulkAsync` for the tracked tags, but the `EventPump` background loop that consumes `StreamEvents` is not restarted. After a stream fault `EventPump.RunAsync` exits and `_channel` is completed; `EventPump.Start()` is a no-op (`if (_loop is not null) return`) because `_loop` is a completed-but-non-null task. So a replayed subscription has no consumer — values are subscribed on the gw but never reach `OnDataChange`. Additionally `ReplayAsync` never re-registers the new item handles the gw returns into `SubscriptionRegistry`; the old stale item handles remain, so even with a live pump the fan-out reverse-map would miss the post-reconnect handles.
**Recommendation:** On reconnect, dispose and recreate the `EventPump` (or make it restartable), and have `ReplayAsync` update `SubscriptionRegistry` bindings with the new item handles returned by the post-reconnect `SubscribeBulkAsync`. Add an integration/parity test that drops the stream mid-subscription and asserts `OnDataChange` resumes.
**Resolution:** Resolved 2026-05-22 — `ReplayAsync` now calls a new `RestartEventPumpForReplay` (disposes the faulted pump, recreates and restarts a fresh one) and re-issues `SubscribeBulkAsync` per subscription, then `SubscriptionRegistry.Rebind` swaps each subscription's stale pre-reconnect item handles for the post-reconnect handles so the fan-out reverse map dispatches to the live pump. New `SubscriptionRegistry.SnapshotEntries`/`Rebind` APIs back the per-subscription replay. Regression coverage in `SubscriptionRegistryTests` (Rebind/SnapshotEntries) and `EventPumpStreamFaultTests.FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch`.
### Driver.Galaxy-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `GalaxyDriver.cs:354-371` |
| Status | Resolved |
**Description:** `StartDeployWatcher` launches the watch loop with `_ = _deployWatcher.StartAsync(CancellationToken.None)` — a fire-and-forget with a discarded `Task`. `StartAsync` can throw synchronously (`InvalidOperationException` if already started); the discard masks that programming error. Separately, `StartDeployWatcher` builds an `_ownedRepositoryClient` purely for the watcher when discovery has not run yet — if `DiscoverAsync` later runs, `BuildDefaultHierarchySource` overwrites `_ownedRepositoryClient` with a second client, leaking the first (only the latest reference is disposed in `Dispose`).
**Recommendation:** Await `StartAsync` (it completes synchronously after scheduling) or at least observe its result. Reuse a single `GalaxyRepositoryClient` across the deploy watcher and the hierarchy source instead of letting `BuildDefaultHierarchySource` clobber the field — guard the assignment or build the client once in `InitializeAsync`.
**Resolution:** Resolved 2026-05-22 — (a) replaced `_ = _deployWatcher.StartAsync(...)` discard with an explicit variable + `IsFaulted` check so any synchronous throw from `StartAsync` (e.g. called-twice `InvalidOperationException`) propagates rather than being silently swallowed; (b) changed both `StartDeployWatcher` and `BuildDefaultHierarchySource` to use `_ownedRepositoryClient ??=` so a client built by the watcher is reused by discovery instead of being overwritten and leaked — only one `GalaxyRepositoryClient` instance is now created and disposed.
### Driver.Galaxy-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `GalaxyDriver.cs:311-341` |
| Status | Open |
**Description:** `ResolveApiKey` supports an `env:`/`file:` indirection and otherwise treats the config string as the literal API key ("Anything else — used as the literal API key. Convenient for dev"). `GalaxyGatewayOptions`' own XML doc claims "the API key never appears in cleartext config". The literal-key fallback silently permits a plaintext API key in the `DriverConfig` JSON column of the central config DB, contradicting the documented contract. There is no warning logged when the literal path is taken.
**Recommendation:** Log a startup warning when `ResolveApiKey` falls through to the literal arm so an operator who accidentally committed a cleartext key sees it, and update the `GalaxyGatewayOptions` doc comment so it no longer over-promises. Consider gating the literal arm behind an explicit `dev:`-style prefix so a cleartext key cannot be used by accident.
**Resolution:** _(open)_
### Driver.Galaxy-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `GalaxyDriver.cs:411` |
| Status | Resolved |
**Description:** `GetMemoryFootprint()` unconditionally returns `0` with a comment "PR 4.4 sets this from SubscriptionRegistry size" — PR 4.4 has shipped (the registry exists and is used) but the method was never updated. `IHostConnectivityProbe.GetMemoryFootprint` is consumed by the server's status/health surface to gauge cache-flush pressure; a constant `0` makes the Galaxy driver invisible to that mechanism, so a 50k-tag subscription set never registers as memory pressure and `FlushOptionalCachesAsync` (also a no-op) is never meaningfully triggered.
**Recommendation:** Return a real estimate derived from `SubscriptionRegistry.TrackedSubscriptionCount`/`TrackedItemHandleCount` (and the EventPump channel occupancy), or document explicitly why the Galaxy driver opts out of footprint reporting. Remove the stale "PR 4.4 sets this" comment.
**Resolution:** Resolved 2026-05-22 — replaced the constant `0` with a live estimate derived from `SubscriptionRegistry.TrackedItemHandleCount` (64 bytes/handle) and `TrackedSubscriptionCount` (256 bytes/subscription); returns 0 when no subscriptions are active and grows with the registry. The stale "PR 4.4 sets this" comment is removed. Regression coverage in `GalaxyDriverInfrastructureTests`.
### Driver.Galaxy-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `Runtime/SubscriptionRegistry.cs:65-67`, `GalaxyDriver.cs:538`, `GalaxyDriver.cs:675` |
| Status | Open |
**Description:** Several hot paths are O(n^2) per call. `SubscriptionRegistry.ResolveSubscribers` does `entry.Bindings.FirstOrDefault(b => b.ItemHandle == itemHandle)` — a linear scan of the whole binding list for every event dispatch; at 50k tags this is 50k-element scans on the 1Hz fan-out path. `GalaxyDriver.SubscribeAsync` and `ReadViaSubscribeOnceAsync` correlate results to references with `results.FirstOrDefault(r => string.Equals(...))` inside a `for` loop over all references — O(n^2) over the subscribe batch. `SubscriptionRegistry.Remove` rebuilds a `ConcurrentBag` from a LINQ filter on every unsubscribe.
**Recommendation:** Index `SubscriptionEntry` bindings by item handle (a `Dictionary<int, string>` per entry) so `ResolveSubscribers` is O(1) per subscriber. Project the `SubscribeResult` list into a `Dictionary<string, SubscribeResult>` (OrdinalIgnoreCase) once before the correlation loop. These matter on the documented 50k-tag soak path.
**Resolution:** _(open)_
### Driver.Galaxy-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `GalaxyDriver.cs:14-27`, `GalaxyDriver.cs:374-382`, `Config/GalaxyDriverOptions.cs:84-86` |
| Status | Open |
**Description:** Multiple doc comments are stale relative to the shipped code. `GalaxyDriver`'s class summary still describes the file as "the project skeleton with `IDriver` bodies that wire to a future `IGalaxyGatewayClient` abstraction. Capability interfaces ... land in PRs 4.1-4.7" and references the legacy `GalaxyProxyDriver` coexisting "until PR 7.2" — but PR 7.2 already deleted the legacy Galaxy projects and the capability interfaces are all implemented. `ReinitializeAsync` is still a stub ("for the skeleton we just refresh health") that ignores `driverConfigJson` entirely — a config reapply silently does nothing. `GalaxyReconnectOptions.ReplayOnSessionLost` is defined and documented but never read anywhere in the driver (`ReplayAsync` always replays).
**Recommendation:** Refresh the `GalaxyDriver` class and `ReinitializeAsync` doc comments to describe the shipped state, implement or explicitly reject `ReinitializeAsync` config reapply, and either honour `ReplayOnSessionLost` or remove it from `GalaxyReconnectOptions`.
**Resolution:** _(open)_
### Driver.Galaxy-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` (module-wide) |
| Status | Resolved |
**Description:** The reconnect/recovery path is the module's highest-risk surface and is effectively untested at the integration seam. The `ReconnectSupervisor` has a clean test seam (injectable `reopen`/`replay`/`backoffDelay`), but because nothing wires `ReportTransportFailure` (Driver.Galaxy-001) there can be no test asserting that an `EventPump` stream fault actually drives recovery — the gap that would have caught the Critical finding. Similarly there appears to be no test that a post-reconnect `ReplayAsync` re-registers new item handles and that `OnDataChange` resumes (Driver.Galaxy-008). The `StatusCodeMap.FromMxStatus` `Success`-flag semantics (Driver.Galaxy-003) and the `DataTypeMap` Int64 gap (Driver.Galaxy-002) are also the kind of behaviour a focused unit test would pin.
**Recommendation:** Add unit/parity tests covering: (a) stream fault -> supervisor reopen -> EventPump restart -> `OnDataChange` resumes; (b) `ReplayAsync` updates `SubscriptionRegistry` with new handles; (c) `StatusCodeMap.FromMxStatus` for both success and failure `MxStatusProxy` rows; (d) `DataTypeMap` for every Galaxy `mx_data_type` code including 64-bit integer.
**Resolution:** Resolved 2026-05-22 — added `GalaxyDriverInfrastructureTests` covering `GetMemoryFootprint` (Driver.Galaxy-011) and `IAsyncDisposable` (Driver.Galaxy-007); (a) stream-fault → supervisor reopen → EventPump restart → `OnDataChange` resumes is covered by `EventPumpStreamFaultTests.StreamFault_DrivesReconnectSupervisorReopenReplay` and `FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch` (landed with Driver.Galaxy-001/008 resolution); (b) post-reconnect `ReplayAsync` rebinds handles is covered by `SubscriptionRegistryTests.Rebind_*` suite; (c) `StatusCodeMap.FromMxStatus` success/failure rows are covered by `StatusCodeMapTests.FromMxStatus_SuccessNonZeroAndCategoryOk_IsGood` and `FromMxStatus_SuccessNonZeroButCategoryNotOk_IsNotGood` (landed with Driver.Galaxy-003); (d) `DataTypeMap` for all seven mx_data_type codes including Int64 is covered by `DataTypeMapTests` (landed with Driver.Galaxy-002).
@@ -0,0 +1,294 @@
# Code Review — Driver.Historian.Wonderware.Client
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Historian.Wonderware.Client-001, Driver.Historian.Wonderware.Client-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.Historian.Wonderware.Client-003, Driver.Historian.Wonderware.Client-004 |
| 4 | Error handling & resilience | Driver.Historian.Wonderware.Client-005, Driver.Historian.Wonderware.Client-006 |
| 5 | Security | Driver.Historian.Wonderware.Client-007, Driver.Historian.Wonderware.Client-008 |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Historian.Wonderware.Client-009 |
| 10 | Documentation & comments | Driver.Historian.Wonderware.Client-010 |
## Findings
### Driver.Historian.Wonderware.Client-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `WonderwareHistorianClient.cs:98-113` |
| Status | Resolved |
**Description:** `ReadAtTimeAsync` violates the explicit `IHistorianDataSource.ReadAtTimeAsync`
contract. The interface XML doc states: the returned list MUST be the same length and
order as `timestampsUtc`, and gaps are returned as Bad-quality snapshots. The client passes
`reply.Samples` straight through `ToSnapshots` with no check that the sidecar returned
exactly one sample per requested timestamp, nor that the order matches. If the sidecar
returns fewer/more samples (e.g. it drops boundary-less timestamps), the OPC UA
HistoryReadAtTime service receives a result that the spec-compliant caller expects to
index positionally against the request timestamps, silently misaligning values with
timestamps. The matching `ReadAtTimeAsync_PreservesTimestampOrder` test only passes because
the fake echoes the request verbatim; it never exercises a short/reordered reply.
**Recommendation:** After receiving the reply, reconcile `reply.Samples` against
`timestampsUtc` by timestamp: build the result array at `timestampsUtc.Count`, fill matched
entries, and emit a Bad-quality (`0x80000000`) snapshot for any requested timestamp the
sidecar did not return. Alternatively assert `reply.Samples.Length == timestampsUtc.Count`
and fail loudly. Add a test where the fake returns a partial/reordered sample set.
**Resolution:** Resolved 2026-05-22 — `ReadAtTimeAsync` now reconciles the sidecar reply against the requested timestamps via a new `AlignAtTimeSnapshots` helper: it indexes returned samples by timestamp ticks, builds the result at `timestampsUtc.Count` in request order, and emits a Bad-quality (`0x80000000`) snapshot for any requested timestamp the sidecar did not return; added the `ReadAtTimeAsync_PartialAndReorderedReply_AlignsByTimestamp_AndFillsGapsAsBad` regression test.
### Driver.Historian.Wonderware.Client-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `WonderwareHistorianClient.cs:154-199`, `IAlarmHistorianSink.cs:66-74` |
| Status | Resolved |
**Description:** `WriteBatchAsync` can never return `HistorianWriteOutcome.PermanentFail`.
`HistorianWriteOutcome` defines three states (`Ack`, `RetryPlease`, `PermanentFail`) and
the drain worker is documented to move the event to the dead-letter table on
`PermanentFail`. The client maps the sidecar `WriteAlarmEventsReply.PerEventOk` bool array
to only `Ack`/`RetryPlease`, and the whole-call-failure and catch paths also only emit
`RetryPlease`. A malformed alarm event the sidecar can never persist (unrecoverable SDK
error on that specific row) therefore retries forever, blocking the head of the
store-and-forward queue and never dead-lettering. The wire contract
(`WriteAlarmEventsReply`) carries no per-event permanent/transient distinction, so the
limitation is structural.
**Recommendation:** Extend the wire contract: replace `bool[] PerEventOk` with a
per-event status enum (Ack/Retry/Permanent), coordinated as an additive change on both
sidecar and client per the Contracts.cs versioning rules, so unrecoverable events can be
dead-lettered. Until then, document explicitly that this writer never produces
`PermanentFail` and that poison events retry indefinitely.
**Resolution:** Resolved 2026-05-22 — extending the wire contract (replacing `bool[] PerEventOk` with a per-event status enum) requires a coordinated change to the .NET 4.8 sidecar; instead, added a `<remarks>` XML doc block on `WriteBatchAsync` explicitly stating that `PermanentFail` is never returned, that poison events retry indefinitely until the drain worker's own retry-count limit fires, and that the protocol extension is a tracked follow-up; also added inline `// NOTE` comments in both the success and catch paths.
### Driver.Historian.Wonderware.Client-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `WonderwareHistorianClient.cs:207`, `WonderwareHistorianClient.cs:132-150` |
| Status | Open |
**Description:** `_totalQueries` is mutated with `Interlocked.Increment` in `Invoke`, but
read inside `GetHealthSnapshot` under `_healthLock`, and every other counter
(`_totalSuccesses`, `_totalFailures`, `_consecutiveFailures`) is mutated only under
`_healthLock`. The two synchronization mechanisms do not compose: an `Interlocked`
increment is not ordered against `lock`-protected reads, so a snapshot can observe a
`_totalQueries` value inconsistent with the lock-protected counters. The window is small
and the counters are advisory, but the mixed model is a latent hazard.
**Recommendation:** Pick one mechanism. Simplest: move the `_totalQueries++` into the
`_healthLock` block (a new `RecordQuery()` helper, or fold it into `RecordSuccess`/
`RecordFailure`) so all six health fields share a single lock.
**Resolution:** _(open)_
### Driver.Historian.Wonderware.Client-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `WonderwareHistorianClient.cs:203-267` |
| Status | Open |
**Description:** A sidecar-reported failure is recorded in two non-atomic steps under
separate lock acquisitions: `Invoke` calls `RecordSuccess()` (line 211) and then the
caller calls `ThrowIfFailed` which calls `ReclassifySuccessAsFailure()` (line 256),
decrementing `_totalSuccesses` and incrementing `_totalFailures`. Between those two locked
regions a concurrent `GetHealthSnapshot` can observe a transient state where the operation
counts as both a success and not-yet-a-failure (`_totalSuccesses` inflated,
`_consecutiveFailures` still 0). The undo-a-success/record-a-failure dance is also fragile:
if a future change adds an early return or exception between `RecordSuccess` and
`ThrowIfFailed`, the success is never reversed.
**Recommendation:** Classify the call once: do not call `RecordSuccess` until the
sidecar-level `Success` flag has been checked, or pass the reply success/error into a
single `RecordOutcome(bool transportOk, bool sidecarOk, string? error)` that updates all
counters under one lock acquisition.
**Resolution:** _(open)_
### Driver.Historian.Wonderware.Client-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `Ipc/FrameReader.cs:31-32` |
| Status | Resolved |
**Description:** After reading the 4-byte length prefix, `ReadFrameAsync` reads the kind
byte with the synchronous, blocking `_stream.ReadByte()` and ignores the
`CancellationToken`. On a `NamedPipeClientStream` with `PipeOptions.Asynchronous`, a
synchronous `ReadByte()` blocks the calling thread until a byte arrives or the pipe
closes. If the sidecar sends a length prefix and then stalls (slow/hung peer), the call
hangs on a thread-pool thread and the `EffectiveCallTimeout` linked token in
`PipeChannel.InvokeAsync` cannot interrupt it because the timeout only fires between
awaits. This defeats the documented cap on a single read/write call once connected and can
wedge the single-in-flight call gate.
**Recommendation:** Read the kind byte asynchronously and cancellably: extend the length
prefix read to 5 bytes, or do a second `ReadExactAsync(new byte[1], ct)`. This makes the
whole frame read honor the call-timeout token and matches the async style of the rest of
the reader.
**Resolution:** Resolved 2026-05-22 — replaced the synchronous, non-cancellable `_stream.ReadByte()` for the kind byte with an async `ReadExactAsync(new byte[1], ct)` call so the full frame read honours the call-timeout token and cannot wedge the channel on a stalled peer.
### Driver.Historian.Wonderware.Client-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `Internal/PipeChannel.cs:96-107`, `WonderwareHistorianClientOptions.cs:11-12` |
| Status | Open |
**Description:** `PipeChannel.InvokeAsync` retries exactly once on transport failure and
otherwise propagates. The options expose `ReconnectInitialBackoff` and
`ReconnectMaxBackoff` and `WonderwareHistorianClientOptions` documents them as exponential
backoff between reconnects, but neither field is referenced anywhere in the module: the
single retry reconnects immediately with no delay. A sidecar that is restarting will
reject or refuse the immediate reconnect, the call fails, and there is no backoff before
the next caller-driven attempt. Either the backoff belongs in the channel and is missing,
or the options are dead config that misleads operators.
**Recommendation:** Either implement the documented exponential backoff in the reconnect
path, or remove the two unused option fields and their XML docs and state plainly that
retry/backoff is owned by the caller (the alarm drain worker / history router).
**Resolution:** _(open)_
### Driver.Historian.Wonderware.Client-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `WonderwareHistorianClient.cs:276` |
| Status | Resolved |
**Description:** `ToSnapshots` deserializes peer-supplied bytes with
`MessagePackSerializer.Deserialize<object>(dto.ValueBytes)`, typeless MessagePack
deserialization. The `object` overload resolves runtime types from the wire payload. The
client treats the pipe peer as untrusted elsewhere (16 MiB frame cap stated to protect
the receiver from a hostile or buggy peer, shared-secret Hello). Typeless deserialization
of bytes that originate from the historian database widens the trust surface. The
MessagePack standard resolver is primitive-only by default so the practical blast radius
is limited, but this is the pattern called out by the two suppressed MessagePack
advisories on this project (see finding 008).
**Recommendation:** Confirm the serializer options here use the default (non-typeless)
resolver and that no `TypelessContractlessStandardResolver` is in play; if so, document
that. Prefer round-tripping the value as a constrained set of known primitive types rather
than `object`, and validate `ValueBytes.Length` against a sane per-sample cap before
deserializing.
**Resolution:** Resolved 2026-05-22 — added `DeserializeSampleValue()` helper that enforces a 64 KiB per-sample `ValueBytes` cap before deserialization and documents that the default `StandardResolver` (primitive-only, no `TypelessContractlessStandardResolver`) is in use; both `ToSnapshots` and `AlignAtTimeSnapshots` now route through the helper; added inline XML comments to the two `NuGetAuditSuppress` entries in the csproj stating the advisory title, why it does not apply to this usage, and the revisit trigger.
### Driver.Historian.Wonderware.Client-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Security |
| Location | `ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.csproj:29-32` |
| Status | Open |
**Description:** The csproj suppresses two NuGet audit advisories
(`GHSA-37gx-xxp4-5rgx`, `GHSA-w3x6-4m5h-cxqf`) for the `MessagePack` 2.5.187 dependency
with no inline comment recording why the suppression is safe, who reviewed it, or when it
should be revisited. Blanket `NuGetAuditSuppress` entries silence the very signal that
would flag the next related CVE. Combined with finding 007 (typeless deserialization), an
unexplained MessagePack advisory suppression is a maintainability and audit-trail gap.
**Recommendation:** Add an XML comment next to each `NuGetAuditSuppress` stating the
advisory title, why it does not apply to this module usage, and a revisit trigger. Track a
follow-up to upgrade `MessagePack` once a patched version is available so the suppressions
can be dropped.
**Resolution:** _(open)_
### Driver.Historian.Wonderware.Client-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests/WonderwareHistorianClientTests.cs` |
| Status | Resolved |
**Description:** The suite covers happy paths, server-error, bad-secret, a single
reconnect and health counters, but several critical paths are untested:
(1) `ReadAtTimeAsync` with a partial/reordered sidecar reply, the contract-alignment case
from finding 001 (the existing test only echoes the request);
(2) the `WriteBatchAsync` catch branch, a transport/deserialization throw during a write,
which must return `RetryPlease` for every event;
(3) `InvokeAsync` second-attempt-also-fails path (the test only proves a successful
reconnect, never a reconnect that fails again and propagates);
(4) the `CallTimeout` path, no test asserts that a stalled sidecar produces a timed-out
`OperationCanceledException`;
(5) `MapAggregate` for `HistoryAggregateType.Total` throwing `NotSupportedException`;
(6) the `InvalidDataException` path when the sidecar replies with an unexpected
`MessageKind`. The byte-equality / round-trip parity test the Contracts.cs and Framing.cs
comments repeatedly promise is not present in this test project.
**Recommendation:** Add the missing-edge-case tests above. In particular add the
wire-parity test the source comments commit to: serialize each DTO with the client copy
and assert byte-equality against the sidecar `Driver.Historian.Wonderware.Ipc` copy, so a
silent `[Key]` drift between the two duplicated contract sets is caught at build time.
**Resolution:** Resolved 2026-05-22 — added six missing tests to `WonderwareHistorianClientTests.cs` (WriteBatchAsync transport-drop catch path returns RetryPlease; InvokeAsync both-attempts-fail propagates exception; stalled sidecar fires OperationCanceledException within CallTimeout; ReadProcessedAsync Total aggregate throws NotSupportedException; sidecar wrong-kind reply throws InvalidDataException) and extended `FakeSidecarServer` with `DisconnectBeforeReply`, `ReplyWithWrongKind`, and `StallAfterRequest` test knobs; added new `ContractsWireParityTests.cs` with 11 tests pinning MessagePack byte layout, round-trip correctness, MessageKind enum values, and Framing constants to catch silent `[Key]` index drift between the client and sidecar mirror copies. Total test count grew from 11 to 27, all passing.
### Driver.Historian.Wonderware.Client-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `WonderwareHistorianClient.cs:355-361`, `WonderwareHistorianClient.cs:132-150` |
| Status | Open |
**Description:** Two doc/behaviour mismatches.
(1) The `Dispose()` XML comment asserts the underlying channel async cleanup is
non-blocking so the `GetAwaiter()/GetResult()` bridge is safe. `PipeChannel.DisposeAsync`
calls `ResetTransport()`, which invokes synchronous `Stream.Dispose()` on a
`NamedPipeClientStream`; pipe disposal can block briefly on OS handle teardown. The bridge
is safe (no deadlock, no captured context) but not strictly non-blocking; the comment
should say "does not deadlock".
(2) `GetHealthSnapshot` populates both `ProcessConnectionOpen` and `EventConnectionOpen`
from the same `_channel.IsConnected`, and `ActiveProcessNode`/`ActiveEventNode`/`Nodes`
are hard-coded to null/empty. A consumer reading `HistorianHealthSnapshot` would assume
two independent connections and per-node health; this client has a single channel and no
node concept. The collapse is reasonable but undocumented.
**Recommendation:** Reword the `Dispose()` comment to claim only deadlock-safety. Add a
short remark on `GetHealthSnapshot` explaining that the single-channel client maps both
connection flags to one transport and does not track per-node health.
**Resolution:** _(open)_
@@ -0,0 +1,337 @@
# Code Review — Driver.Historian.Wonderware
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 7 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness and logic bugs | Driver.Historian.Wonderware-001, -002, -003, -004 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency and thread safety | Driver.Historian.Wonderware-005 |
| 4 | Error handling and resilience | Driver.Historian.Wonderware-006, -007, -008 |
| 5 | Security | No issues found |
| 6 | Performance and resource management | Driver.Historian.Wonderware-009, -010 |
| 7 | Design-document adherence | Driver.Historian.Wonderware-011 |
| 8 | Code organization and conventions | No issues found |
| 9 | Testing coverage | Driver.Historian.Wonderware-012 |
| 10 | Documentation and comments | No issues found |
## Findings
### Driver.Historian.Wonderware-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness and logic bugs |
| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:68`, `Backend/AahClientManagedAlarmEventWriter.cs:82-103` |
| Status | Resolved |
**Description:** `MalformedErrors` includes `HistorianAccessError.ErrorValue.WriteToReadOnlyFile`.
When `ClassifyOutcome` routes that code through `MapOutcome`, `isMalformedInput` is
`true`, so the per-event result becomes `PermanentFail` and the lmxopcua-side
store-and-forward sink dead-letters the alarm event. But `WriteToReadOnlyFile` is
not a property of the event payload; it is a connection-configuration fault (the
write backend opened the session without `ReadOnly` set to `false`, or the SDK
defaulted it). Treating it as permanent means a misconfigured or regressed
connection would silently and permanently discard every alarm event in the batch
instead of deferring them for retry once the connection is corrected.
Alarm-event historization is the module's whole purpose, so this is data loss.
**Recommendation:** Move `WriteToReadOnlyFile` out of `MalformedErrors`. It should
be treated as a connection-class error (abort the batch, reset the connection so
the reconnect path can re-open with `ReadOnly = false`) or at minimum as
`RetryPlease`, never `PermanentFail`.
**Resolution:** Resolved 2026-05-22 — moved `WriteToReadOnlyFile` from `MalformedErrors` into `ConnectionErrors` so the batch loop aborts, resets the connection (re-opening with `ReadOnly = false`), and defers the events as `RetryPlease` instead of dead-lettering them.
### Driver.Historian.Wonderware-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness and logic bugs |
| Location | `Ipc/HistorianFrameHandler.cs:162`, `:181` |
| Status | Resolved |
**Description:** `HandleWriteAlarmEventsAsync` dereferences `req.Events.Length`
in both the `_alarmWriter is null` branch (line 162) and the catch block (line
181). MessagePack deserializes an absent or explicit-nil array field as a `null`
reference, not `Array.Empty<T>()`. A client (or a buggy/hostile peer) that sends
a `WriteAlarmEventsRequest` with a null `Events` array triggers a
`NullReferenceException`. Although `RunOneConnectionAsync` would log it and accept
the next connection, the request gets no reply frame, so the client correlation-id
wait hangs until its own timeout. `AahClientManagedAlarmEventWriter.WriteAsync`
already null-guards `events`; the frame handler does not.
**Recommendation:** Normalize `req.Events` to `Array.Empty<AlarmHistorianEventDto>()`
immediately after deserialization (or guard each `.Length` access), consistent
with the null-tolerance the writer already has.
**Resolution:** Resolved 2026-05-22 — normalise `req.Events` to `Array.Empty<AlarmHistorianEventDto>()` immediately after deserialization so all subsequent `.Length` accesses are safe against null frames.
### Driver.Historian.Wonderware-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness and logic bugs |
| Location | `Backend/HistorianDataSource.cs:320-323`, `:457-460` |
| Status | Resolved |
**Description:** Raw and at-time reads decide whether a sample is a string or a
numeric with `if (!string.IsNullOrEmpty(result.StringValue) && result.Value == 0)`.
The `result.Value == 0` clause is intended to distinguish a real numeric zero from
a string tag whose numeric projection is zero, but it is wrong in both directions:
a numeric (analog) tag that legitimately sampled the value `0` while the SDK also
populates a non-empty `StringValue` (some Historian builds populate the formatted
text on every result) is reported to OPC UA as a string, changing the variable
data type mid-stream; conversely a string tag whose numeric projection is non-zero
is reported as a numeric. The historian SDK exposes the tag actual data type,
which should drive the branch instead of a value heuristic.
**Recommendation:** Select string vs. numeric from the SDK result tag-data-type
field rather than from `Value == 0`. If the type field is genuinely unavailable in
the bound SDK version, document the limitation explicitly and prefer numeric for
analog/integer tags.
**Resolution:** Resolved 2026-05-22 — extracted the heuristic into a `SelectValue` helper with a detailed XML doc comment explaining the SDK limitation (`HistoryQueryResult` has no data type field in the bound `aahClientManaged` version); the existing `Value == 0` discriminator is preserved as the best available heuristic with the known edge-case documented.
### Driver.Historian.Wonderware-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness and logic bugs |
| Location | `Backend/SdkAlarmHistorianWriteBackend.cs:198-201` |
| Status | Open |
**Description:** `ToHistorianEvent` only assigns `historianEvent.Id` when
`Guid.TryParse(dto.EventId, ...)` succeeds. If `EventId` is not a parseable GUID
(or is empty), `Id` stays `Guid.Empty` and the event is written to the historian
with an all-zeros identifier. Multiple such events collide on the same id, and the
write is still accepted (`outcomes[i] = Ack`) so neither side detects the problem.
The non-parseable case is never logged.
**Recommendation:** Log a warning when `EventId` fails to parse, and either reject
the event as `PermanentFail` (malformed input) or synthesize a fresh
`Guid.NewGuid()` so each event still gets a unique id.
**Resolution:** _(open)_
### Driver.Historian.Wonderware-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency and thread safety |
| Location | `Backend/HistorianDataSource.cs:124`, `:126-127` |
| Status | Open |
**Description:** `GetHealthSnapshot` reads `_activeProcessNode` and
`_activeEventNode` inside `_healthLock`, but those two fields are written under
`_connectionLock` / `_eventConnectionLock` (lines 183, 243, 209-210, 266-269) — a
different lock. The health-counter fields are correctly `_healthLock`-protected,
but the active-node strings are published under one lock and read under another,
so the snapshot can observe a stale active-node value relative to the
connection-open booleans. This is a diagnostics-only path, so impact is limited to
a momentarily inconsistent health snapshot.
**Recommendation:** Pick one lock for the active-node strings (publish them under
`_healthLock` on every connection state change, or read them under the connection
lock), so the snapshot is internally consistent.
**Resolution:** _(open)_
### Driver.Historian.Wonderware-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling and resilience |
| Location | `Ipc/PipeServer.cs:120-128` |
| Status | Resolved |
**Description:** `RunAsync` re-accepts connections in a `while` loop. If
`RunOneConnectionAsync` throws synchronously and immediately on every iteration
(for example `new NamedPipeServerStream(...)` fails because the pipe name is
already in use, or `PipeAcl.Create` throws), the loop spins with no delay and no
backoff, pegging a CPU core and flooding the rolling log file with one `Error`
line per iteration. There is no circuit-breaker or retry cap.
**Recommendation:** Add a short delay (exponential backoff capped at a few
seconds) before re-accepting after a caught exception, and consider a
consecutive-failure threshold that escalates to a fatal exit so the supervisor can
restart the sidecar cleanly.
**Resolution:** Resolved 2026-05-22 — added exponential backoff (250 ms → 8 s, six steps) after each connection-loop failure and a `MaxConsecutiveFailures=20` threshold that re-throws so the SCM/NSSM supervisor can restart the sidecar cleanly.
### Driver.Historian.Wonderware-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling and resilience |
| Location | `Ipc/PipeServer.cs:70-75` |
| Status | Open |
**Description:** When `VerifyCaller` rejects the peer SID, the server logs the
reason and calls `_current.Disconnect()` with no `HelloAck` frame sent. The
shared-secret-mismatch and major-version-mismatch paths below it both send a
rejecting `HelloAck` so the client learns why. A client that fails the SID check
instead sees an abrupt disconnect and must rely on its own read timeout, with no
diagnostic on the client side. The asymmetry also makes the SID-rejection path
harder to test from the client.
**Recommendation:** Send a `HelloAck` with `Accepted = false` and a
`caller-sid-mismatch` reject reason before disconnecting, consistent with the
other two rejection paths.
**Resolution:** _(open)_
### Driver.Historian.Wonderware-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling and resilience |
| Location | `Backend/HistorianDataSource.cs:301-307`, `:374-380` |
| Status | Open |
**Description:** When `query.StartQuery` returns `false`, `ReadRawAsync` and
`ReadAggregateAsync` call `HandleConnectionError()` and return an empty result
list. A failed `StartQuery` is not necessarily a connection failure — it can be a
bad tag name, an invalid time range, or an unsupported aggregate — yet the code
unconditionally tears down the shared SDK connection. A burst of queries with one
bad tag name therefore repeatedly drops and re-opens the (relatively expensive)
historian connection and marks the cluster node failed via `HandleConnectionError`
into `_picker.MarkFailed`, which can push an otherwise healthy node into cooldown.
The empty-list result is also indistinguishable from "no data in range" to the
caller — the `Success` flag on the reply will still be `true`.
**Recommendation:** Inspect `error.ErrorCode` to distinguish connection-class
failures (reset and mark node failed) from query-class failures (leave the
connection intact, surface the error). Consider returning a failed reply
(`Success = false`) for query-class `StartQuery` failures so the client does not
treat an SDK error as an empty history.
**Resolution:** _(open)_
### Driver.Historian.Wonderware-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance and resource management |
| Location | `Backend/HistorianDataSource.cs:382-395`, `Ipc/Contracts.cs:85-99` |
| Status | Resolved |
**Description:** `ReadAggregateAsync` drains `query.MoveNext` into `results` with
no upper bound, unlike `ReadRawAsync`, which honours `maxValues` /
`MaxValuesPerRead` and breaks. `ReadProcessedRequest` carries no max-buckets field.
A processed read over a wide time range with a small `IntervalMs` produces an
unbounded `HistorianAggregateSample` list; the handler then serializes it into
`ReadProcessedReply`. If the serialized body exceeds the 16 MiB
`Framing.MaxFrameBodyBytes` cap, `FrameWriter.WriteAsync` throws and the entire
reply is lost (the client correlation wait hangs), and before that point the
sidecar holds the whole result set in memory.
**Recommendation:** Apply `_config.MaxValuesPerRead` as a bucket cap in
`ReadAggregateAsync` (mirroring the raw path), and/or add a `MaxBuckets` field to
`ReadProcessedRequest`. Reject or truncate result sets that would exceed the frame
cap with an explicit error reply rather than letting `WriteAsync` throw.
**Resolution:** Resolved 2026-05-22 — applied `_config.MaxValuesPerRead` as a bucket cap in `ReadAggregateAsync` mirroring the raw-read path; truncation logs a Warning with the limit and a hint to widen `IntervalMs` or reduce the time range.
### Driver.Historian.Wonderware-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance and resource management |
| Location | `Backend/HistorianConfiguration.cs:32-36`, `Backend/HistorianDataSource.cs` (all read methods) |
| Status | Open |
**Description:** `HistorianConfiguration.RequestTimeoutSeconds` is documented as
the "outer safety timeout applied to sync-over-async Historian operations" and is
copied around (`SdkAlarmHistorianWriteBackend.CloneConfigWithServerName:346`), but
it is never read or enforced anywhere. The `HistorianDataSource` read methods are
declared `Task`-returning but execute the SDK calls synchronously on the caller
thread and only check the `CancellationToken` between `MoveNext` iterations. There
is no outer timeout: a hung `StartQuery` or a slow `MoveNext` blocks the single
pipe-server connection thread indefinitely (the connect path has its own poll
timeout, but the query path does not). The documented safety net does not exist.
**Recommendation:** Either wire `RequestTimeoutSeconds` into the read paths (a
`CancellationTokenSource.CancelAfter` linked into `ct`, or run the SDK call on a
worker with a bounded wait), or remove the property and its XML doc so the code
does not advertise a guarantee it does not provide.
**Resolution:** _(open)_
### Driver.Historian.Wonderware-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `Backend/HistorianDataSource.cs:9-12`, `Backend/IHistorianDataSource.cs:9-11`, `Backend/HistorianSample.cs:7-9`, `Backend/HistorianConfiguration.cs:7-9` |
| Status | Open |
**Description:** Several XML doc comments reference the retired v1 architecture as
if it were current: "inside Galaxy.Host", "the Proxy maps returned samples", "the
Host returns these across the IPC boundary as `GalaxyDataValue`", "Populated from
... the Proxy DriverInstance.DriverConfig". Per `CLAUDE.md`, PR 7.2 retired the
`Galaxy.Host` / `Galaxy.Proxy` / `Galaxy.Shared` projects, and this driver is now a
standalone sidecar whose client is the .NET 10 `WonderwareHistorianClient`
(`docs/AlarmTracking.md`). The comments are stale and misdescribe the current data
flow, which contradicts the "no stale design docs/comments" expectation in the
review checklist.
**Recommendation:** Update the doc comments to describe the current sidecar/IPC
architecture (sidecar talking to `WonderwareHistorianClient` over the named pipe),
dropping the `Galaxy.Host` / `Proxy` / `GalaxyDataValue` references.
**Resolution:** _(open)_
### Driver.Historian.Wonderware-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `Backend/HistorianDataSource.cs`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` |
| Status | Open |
**Description:** The unit-test suite covers `HistorianQualityMapper`,
`HistorianClusterEndpointPicker`, `SdkAlarmHistorianWriteBackend`,
`AahClientManagedAlarmEventWriter`, the IPC round trip, and `Program` alarm-writer
wiring. `HistorianDataSource` itself — the largest and most logic-dense file in
the module — has no direct unit coverage of its read paths, despite
`IHistorianConnectionFactory` being explicitly extracted "so tests can inject
fakes that control connection success, failure, and timeout behavior". The
connect-failover-and-cooldown loop (`ConnectToAnyHealthyNode`), the mid-query
connection-reset path (`HandleConnectionError`), the string-vs-numeric value
selection (see -003), the at-time per-timestamp loop, and `ExtractAggregateValue`
column dispatch are all untested. A stale empty test directory
(`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/`, containing only
`bin/obj`) also sits alongside the live `tests/Drivers/...` project and should be
removed to avoid confusion.
**Recommendation:** Add `HistorianDataSource` tests driving an
`IHistorianConnectionFactory` fake — covering failover, cooldown, mid-query reset,
cancellation, and the value-type selection — and delete the stale empty
`tests/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` directory.
**Resolution:** _(open)_
@@ -0,0 +1,241 @@
# Code Review — Driver.Modbus.Addressing
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Addressing` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 3 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Modbus.Addressing-001, -002, -003, -004 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found |
| 4 | Error handling & resilience | Driver.Modbus.Addressing-005, -006 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | Driver.Modbus.Addressing-001, -007 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Modbus.Addressing-008 |
| 10 | Documentation & comments | Driver.Modbus.Addressing-009 |
## Findings
### Driver.Modbus.Addressing-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `ModbusAddressParser.cs:230-235`, `DirectLogicAddress.cs:66-73` |
| Status | Resolved |
**Description:** The DL205 family-native branch routes every V-prefixed address through
`DirectLogicAddress.UserVMemoryToPdu`, which is a plain octal-to-decimal conversion. DL205/DL260
system V-memory (V40400 and up) is NOT a simple octal decode — per `docs/v2/dl205.md` section
V-Memory, V40400 must map to Modbus PDU 0x2100 (decimal 8448) on a factory-mode ECOM module.
The parser instead octal-decodes V40400 to decimal 16640 (0x4100), the wrong register. The
`DirectLogicAddress.SystemVMemoryToPdu` / `SystemVMemoryBasePdu` helper that exists to do this
correctly is never called by the parser — it is dead code from the parser point of view. A tag
spreadsheet that addresses any DL system register through the grammar string silently reads and
writes the wrong PLC memory. The companion test `ModbusFamilyParserTests.cs:20` bakes the wrong
value (V40400 to 16640) into a passing assertion, so the regression is locked in.
**Recommendation:** Make the DL205 V branch detect the system bank (octal address >= 40400) and
route it through `SystemVMemoryToPdu`, or explicitly reject system V-memory in the grammar string
with a diagnostic pointing at the structured tag form. Either way, fix the V40400 test to assert
the corrected mapping.
**Resolution:** Resolved 2026-05-22 — added `DirectLogicAddress.VMemoryToPdu`, which detects the
system bank (octal >= V40400) and relocates it through `SystemVMemoryToPdu` to PDU 0x2100; the
DL205 V branch in `ModbusAddressParser` now calls it, and the `ModbusFamilyParserTests` V40400
assertion was corrected from 16640 to 0x2100 with system-bank regression cases added.
### Driver.Modbus.Addressing-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ModbusAddressParser.cs:86-94` |
| Status | Resolved |
**Description:** In the 3-field disambiguation, an empty 3rd field (`40001:F:`) reaches
`parts[2].All(char.IsDigit)`. `Enumerable.All` returns true for an empty sequence, so the empty
string is classified as a valid-shaped array count, assigned to `countPart`, then silently dropped
by the later `string.IsNullOrEmpty(countPart)` guard. The result is that `40001:F:` parses
successfully as a plain scalar with a dangling empty field rather than being rejected as
malformed. The 4-field form `40001:F::` has the analogous effect. A user who mistypes a trailing
colon gets no diagnostic.
**Recommendation:** Reject an empty 3rd field explicitly, or guard the `All(char.IsDigit)` branch
with `parts[2].Length > 0`.
**Resolution:** Resolved 2026-05-22 — added an explicit `parts[2].Length == 0` check before the `All(char.IsDigit)` branch that returns a descriptive error, so a trailing colon typo produces a diagnostic instead of silently parsing as a scalar.
### Driver.Modbus.Addressing-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ModbusAddressParser.cs:405-406`, `ModbusAddressParser.cs:128` |
| Status | Resolved |
**Description:** `LooksLikeByteOrderToken` classifies any 4-letter token as a byte-order token.
A 3-field address whose 3rd field is a 4-letter type-like token (e.g. `40001:S:BOOL`) is routed
into `TryParseByteOrder`, producing the misleading diagnostic "Unknown byte order BOOL" instead
of telling the user the type belongs in field 2. The type code BOOL is exactly 4 letters and
could only ever be intended as a type — the shape heuristic cannot tell a mistyped type from a
byte order, so the diagnostic actively misdirects.
**Recommendation:** When `TryParseByteOrder` fails on a 4-letter token in the 3-field form, widen
the error message to mention that field 3 is a byte order and field 2 is the type, or attempt a
type-parse fallback before emitting the byte-order error.
**Resolution:** Resolved 2026-05-22 — in the 3-field disambiguation error path, a 4-letter alphanumeric token that looks like a type code now produces a diagnostic explicitly stating that field 3 is the byte-order slot and field 2 is the type slot, directing the user to the correct fix.
### Driver.Modbus.Addressing-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ModbusAddressParser.cs:182-194` |
| Status | Resolved |
**Description:** The bit suffix is stripped using `text.IndexOf('.')` — the first dot. An input
such as `40001.5.3` produces a bit text of "5.3", rejected by `byte.TryParse` with the generic
"Bit index must be 0..15" message. A Modicon-style decimal-point typo like `400.01` is silently
treated as region/offset 400 plus bit 01; 400 then fails Modicon length validation, so the
surfaced error is the Modicon length diagnostic rather than a bit-index diagnostic, because the
bit was parsed first and 01 is a valid bit. The dot-handling assumes a single dot without
asserting it, and the diagnostics for these malformed inputs are inconsistent.
**Recommendation:** Use `LastIndexOf('.')` or assert exactly one dot, and validate that the
region/offset segment is non-empty and dot-free after the strip so malformed inputs get a precise
diagnostic.
**Resolution:** Resolved 2026-05-22 — switched to `LastIndexOf('.')`, added a non-empty guard for the address segment before the dot, and added a check that the address segment itself contains no dot (diagnosing multi-dot inputs with "contains multiple dots" rather than a confusing bit-index error).
### Driver.Modbus.Addressing-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `ModbusAddressParser.cs:200-213` |
| Status | Resolved |
**Description:** `TryParseRegionAndOffset` tries family-native, then mnemonic, then Modicon. When
all three fail it returns false with whatever error the Modicon parser last wrote (comment: "the
Modicon error is the more specific diagnostic"). For a non-Generic family this is misleading:
`TryParseFamilyNative` returns false with error left null for any address that does not start with
a recognised family prefix, and even for recognised prefixes it only sets error inside the catch.
The subsequent mnemonic and Modicon attempts overwrite error. Net effect: a clearly
family-native-shaped input that fails deep in the family helper can still surface a generic
Modicon "must be 5 or 6 digits" error, hiding the real cause (e.g. "contains non-octal digit").
**Recommendation:** When a non-Generic family is configured and the input matches a family
prefix, prefer and preserve the family-native error rather than letting the Modicon fallback
overwrite it.
**Resolution:** Resolved 2026-05-22 — the family-native error is now captured in `familyNativeError` and, after all three branches fail, preferred over the Modicon fallback error when it is non-null (indicating the address matched a family prefix but failed deep inside the helper).
### Driver.Modbus.Addressing-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `ModbusAddressParser.cs:297-301` |
| Status | Open |
**Description:** `TryParseFamilyNative` catches only `ArgumentException` and `OverflowException`.
The current helpers throw only those (including `ArgumentOutOfRangeException`, which derives from
`ArgumentException`), so today it is correct. But the parser intent is to convert helper
exceptions into structured errors; any future helper change that throws a different exception type
(e.g. a `FormatException` from a `ushort.Parse` swap) would escape as an unhandled exception out
of a `TryParse` method, violating the try-parse contract that config-bind hot-path callers
depend on.
**Recommendation:** Either document the exact exception contract of the helpers and keep the
narrow catch, or broaden to a general catch-all that records the message — a try-parse method
should never throw.
**Resolution:** _(open)_
### Driver.Modbus.Addressing-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ModbusDataType.cs:91-95`, `docs/v2/dl205.md` section Strings |
| Status | Open |
**Description:** `ModbusStringByteOrder` (HighByteFirst / LowByteFirst) is defined in this
assembly and documented as the DL205 low-byte-first string-packing knob, but `ParsedModbusAddress`
has no field for it and `ModbusAddressParser` never produces or consumes it. The `STR<n>` grammar
form cannot express the DL205 string byte order described in `docs/v2/dl205.md` — a DL205 string
tag parsed from the grammar string always carries the default order. The enum is effectively
unreachable from the parser, so the grammar cannot represent a known, documented device quirk.
**Recommendation:** Either add a `StringByteOrder` field to `ParsedModbusAddress` plus a grammar
token for it, or document explicitly that DL205 string byte order is only configurable via the
structured tag form and is intentionally out of grammar scope.
**Resolution:** _(open)_
### Driver.Modbus.Addressing-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Addressing.Tests/` |
| Status | Resolved |
**Description:** Several edge cases of the address arithmetic are untested or asserted wrong:
(a) DL205 system V-memory mapping is tested only with the incorrect expected value
(`ModbusFamilyParserTests.cs:20`, see finding -001); (b) there is no test for `UserVMemoryToPdu`
or `AddOctalOffset` overflow (V200000, C200000) hitting the `OverflowException` path; (c) no test
for the empty-trailing-field cases of finding -002; (d) `MelsecAddress.ParseHex` overflow and
`DRegisterToHolding` / `MRelayToCoil` bank-base overflow are untested; (e) no test that
`SystemVMemoryToPdu` is exercised at all. The address-arithmetic overflow and off-by-one paths
are exactly the high-risk surface this module owns, and they are the least covered.
**Recommendation:** Add overflow/boundary tests for every PDU/coil/discrete translation helper
and for the parser count/bit/field edge cases. Correct the V40400 assertion as part of fixing
finding -001.
**Resolution:** Resolved 2026-05-22 — added `ModbusAddressEdgeCaseTests.cs` covering: empty 3rd-field rejection, multi-dot input rejection, `UserVMemoryToPdu` overflow, `AddOctalOffset` overflow via Y and C helpers, `SystemVMemoryToPdu` base/overflow, `MelsecAddress.ParseHex` overflow, `DRegisterToHolding` and `MRelayToCoil` bank-base overflow.
### Driver.Modbus.Addressing-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `ModbusModiconAddress.cs:55-64`, `ModbusModiconAddress.cs:104-110` |
| Status | Open |
**Description:** The comments on `ModbusModiconAddress.TryParse` are slightly inaccurate. The
remark that 5-digit Modicon is always exactly 5 chars (40001..49999) and 6-digit is exactly 6
(400001..465536-shaped) implies the leading digit is always 4, but the parser accepts leading
0/1/3 too — a 5-digit coil is 00001..09999, not 40001..49999. Separately, the line-106 comment
says the 5-digit form caps at 9999 by construction while the adjacent code path applies the same
`> 65536` check to both forms; the comment describes an invariant the code does not rely on.
**Recommendation:** Reword the range examples to cover all four region digits and drop the
caps-at-9999 aside or restate it as a precise statement about trailing-digit count.
**Resolution:** _(open)_
+234
View File
@@ -0,0 +1,234 @@
# Code Review — Driver.Modbus.Cli
| Field | Value |
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Modbus.Cli-001, Driver.Modbus.Cli-002, Driver.Modbus.Cli-003 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.Modbus.Cli-004 |
| 4 | Error handling & resilience | Driver.Modbus.Cli-005, Driver.Modbus.Cli-006 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | Driver.Modbus.Cli-007 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Modbus.Cli-008 |
| 10 | Documentation & comments | No issues found |
## Findings
### Driver.Modbus.Cli-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/SubscribeCommand.cs:43-51` |
| Status | Resolved |
**Description:** `SubscribeCommand` synthesises its `ModbusTagDefinition` with only
`Name`, `Region`, `Address`, `DataType`, `Writable`, and `ByteOrder` — it never
exposes or passes `--bit-index`, `--string-length`, or `--string-byte-order`.
A user running `subscribe -t BitInRegister` always watches bit 0 regardless of
intent, and `subscribe -t String` runs with `StringLength = 0`. The doc
(`docs/Driver.Modbus.Cli.md`) lists `BitInRegister`, `String`, `Bcd16`, `Bcd32`
in the `subscribe` `--type` help text, so these types are advertised as supported
but cannot be used correctly. `read` and `write` both expose all three flags;
`subscribe` is the odd one out.
**Recommendation:** Add `--bit-index`, `--string-length`, and `--string-byte-order`
options to `SubscribeCommand` (mirroring `ReadCommand`) and pass them into the
`ModbusTagDefinition`, or trim the `--type` help text to the types `subscribe`
actually supports and reject `BitInRegister` / `String` at command entry with a
clear message.
**Resolution:** Resolved 2026-05-22 — added `--bit-index`, `--string-length`, and `--string-byte-order` options to `SubscribeCommand`, mirroring `ReadCommand`, and passed them through to `ModbusTagDefinition` so `BitInRegister` and `String` types subscribe correctly.
### Driver.Modbus.Cli-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/WriteCommand.cs:54-89` |
| Status | Resolved |
**Description:** `WriteCommand` rejects read-only regions (`DiscreteInputs` /
`InputRegisters`) but does not validate that `--type` is meaningful for the
`Coils` region. `write -r Coils -a 5 -t UInt16 -v 42` builds a `Coils` tag with
`DataType = UInt16`; the value parses to a boxed `ushort`, and the driver's
`WriteOneAsync` coil branch calls `Convert.ToBoolean(value)` which succeeds for
any non-zero `ushort` (yields `true`). The write silently lands as a coil ON with
no diagnostic, even though the operator asked for a 16-bit register write. A coil
region only supports `Bool`-style boolean values.
**Recommendation:** After the read-only-region check, reject `Region == Coils`
combined with any non-boolean `--type` (anything other than `Bool`), with a
message explaining coils carry a single bit.
**Resolution:** Resolved 2026-05-22 — added a `Region == Coils && DataType != Bool` check immediately after the read-only-region guard, throwing `CommandException` with a message explaining that coils carry a single bit and only `--type Bool` is valid.
### Driver.Modbus.Cli-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/ModbusCommandBase.cs:14-24` |
| Status | Open |
**Description:** `Port` (`int`) and `TimeoutMs` (`int`) accept any 32-bit value,
including negatives and ports above 65535. `UnitId` is a `byte`, so it accepts
0-255 even though the option description and `docs/Driver.Modbus.Cli.md` both say
the valid range is 1-247 (0 is the Modbus broadcast address; 248-255 are
reserved). A negative `--timeout-ms` becomes a negative `TimeSpan` passed straight
into the driver; an out-of-range `--port` fails later with an opaque socket
error. None of these are validated at parse time.
**Recommendation:** Validate `Port` (1-65535), `TimeoutMs` (greater than 0), and
`UnitId` (1-247) at the top of each command's `ExecuteAsync` (or in
`ModbusCommandBase`), throwing `CliFx.Exceptions.CommandException` with a clear
message — consistent with how `WriteCommand` already rejects bad regions and
boolean strings.
**Resolution:** _(open)_
### Driver.Modbus.Cli-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/SubscribeCommand.cs:61-67` |
| Status | Open |
**Description:** The `OnDataChange` handler is invoked from the driver's
`PollGroupEngine` background thread and calls `console.Output.WriteLine`
synchronously. An exception thrown inside this handler (e.g. an `IOException` on a
redirected or closed stdout) propagates on the poll-engine thread and is not
caught — it could fault the background loop. For a long-running `subscribe` this
is a real, if low-probability, crash path. Output lines are also written without
any synchronization, so overlapping poll ticks could interleave partial lines.
**Recommendation:** Wrap the handler body in a `try/catch` that swallows or logs
write failures so a transient console-write error cannot tear down the poll loop.
A single `lock` around the write also removes the interleave risk.
**Resolution:** _(open)_
### Driver.Modbus.Cli-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/ProbeCommand.cs:21-54`; `Commands/ReadCommand.cs:46-75`; `Commands/WriteCommand.cs:54-89` |
| Status | Open |
**Description:** All three commands call `ConfigureLogging()` then
`console.RegisterCancellationHandler()`, but if the operator presses Ctrl+C
before `InitializeAsync` completes, the resulting `OperationCancelledException`
propagates out of `ExecuteAsync` unhandled. CliFx renders unhandled non-
`CommandException` exceptions as a full stack trace, which is noisy for what is
just a user-cancelled run. `SubscribeCommand` correctly catches
`OperationCancelledException` around its `Task.Delay`, but the connect/read/write
commands do not catch it around their driver calls.
**Recommendation:** Either let cancellation surface a clean message (catch
`OperationCancelledException` in each command and exit quietly) or document that
the noisy trace on Ctrl+C-during-connect is acceptable. Consistency with
`SubscribeCommand`'s handling is the cleaner choice.
**Resolution:** _(open)_
### Driver.Modbus.Cli-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/ProbeCommand.cs:35-53` |
| Status | Open |
**Description:** `probe` reports `Health: {health.State}` from `GetHealth()`.
After a successful `InitializeAsync` the driver sets state to `Healthy`
regardless of whether the subsequent probe register read returns Good or a Bad
status code. `ReadAsync` does not throw on a Modbus exception response — it
returns a `DataValueSnapshot` with a Bad `StatusCode`. So `probe` against a host
that accepts the TCP connection but rejects FC03 at the probe address prints
`Health: Healthy` while the snapshot line below shows a Bad status. The two lines
disagree, and the headline `Health` value (the thing an operator scans first)
overstates success. The doc bills `probe` as the "is the PLC up + talking Modbus"
check, which the bare `Healthy` does not actually confirm.
**Recommendation:** Have `probe` derive its headline verdict from the probe
snapshot's `StatusCode` (Good vs Bad) rather than — or in addition to — the driver
`State`, or print a single combined verdict line so the two cannot contradict each
other.
**Resolution:** _(open)_
### Driver.Modbus.Cli-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `docs/Driver.Modbus.Cli.md:124-156`; `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/ReadCommand.cs` |
| Status | Open |
**Description:** `docs/Driver.Modbus.Cli.md` devotes a whole "v2 addressing
grammar" section to the industry-standard tag-address strings (`40001:F:CDAB`,
`HR1:I`, `C100`, `V2000:F:CDAB`, etc.) and says "set the per-tag `addressString`
field instead of the structured `region` + `address` + `dataType` fields." None of
the CLI commands expose an `--address-string` (or equivalent) flag — `read`,
`write`, and `subscribe` only accept the structured `--region` + `--address` +
`--type` triple. The documented address-string grammar is reachable only through a
hand-written `DriverConfig` JSON, not through this CLI. The doc reads as if the CLI
supports it.
**Recommendation:** Either add an `--address-string` option that feeds the
driver's address-string parser (and `--family` for the DL205/MELSEC native
syntax), or scope the "v2 addressing grammar" section of the doc to note it
applies to `DriverConfig` JSON and is not a CLI flag.
**Resolution:** _(open)_
### Driver.Modbus.Cli-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli.Tests/` |
| Status | Open |
**Description:** The test project covers only the two pure-function seams:
`ReadCommand.SynthesiseTagName` and `WriteCommand.ParseValue`. There is no coverage
for `WriteCommand`'s read-only-region rejection (`Region is not (Coils or
HoldingRegisters)`), no test for `ModbusCommandBase.BuildOptions` (e.g. that
`Probe.Enabled` is `false` and `AutoReconnect` tracks `--disable-reconnect`), and
no test asserting unsupported write types throw. The branch logic in
`WriteCommand.ExecuteAsync` and `ModbusCommandBase.BuildOptions` is the part most
likely to regress and is currently untested. The validation gaps in findings
002/003 are also untested precisely because no test exercises that path.
**Recommendation:** Add tests for `WriteCommand`'s region-validation branch and for
`ModbusCommandBase.BuildOptions` (construct a command instance via the `init`
setters and assert the produced `ModbusDriverOptions`). Once findings 002/003 are
fixed, add tests for the new validation paths.
**Resolution:** _(open)_
+207
View File
@@ -0,0 +1,207 @@
# Code Review — Driver.Modbus
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 7 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Modbus-002, Driver.Modbus-005, Driver.Modbus-009 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.Modbus-001, Driver.Modbus-003 |
| 4 | Error handling & resilience | Driver.Modbus-006, Driver.Modbus-010 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.Modbus-004 |
| 7 | Design-document adherence | Driver.Modbus-007 |
| 8 | Code organization & conventions | Driver.Modbus-011 |
| 9 | Testing coverage | Driver.Modbus-012 |
| 10 | Documentation & comments | Driver.Modbus-008 |
## Findings
### Driver.Modbus-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `ModbusDriver.cs:92,99-122` |
| Status | Resolved |
**Description:** `_lastPublishedByRef` is a plain `Dictionary<string, object>` mutated inside `ShouldPublish`, which runs on the `PollGroupEngine.onChange` callback. `PollGroupEngine` runs one background `Task` per subscription (`PollGroupEngine.cs:64`), so a driver with two or more subscriptions invokes `onChange` — and therefore `ShouldPublish` — concurrently on separate threads. `ShouldPublish` does `TryGetValue` and indexer writes on the unsynchronized dictionary (`ModbusDriver.cs:108`, `112`, `120`). Concurrent reads/writes of a non-thread-safe `Dictionary` can corrupt internal state, drop entries, or throw `IndexOutOfRangeException`/`InvalidOperationException`, crashing the poll loop. The sibling cache `_lastWrittenByRef` is correctly guarded by `_lastWrittenLock` — only the deadband cache was left unprotected.
**Recommendation:** Guard `_lastPublishedByRef` with a dedicated lock around every access in `ShouldPublish`, or switch it to `ConcurrentDictionary<string, object>` and use `AddOrUpdate`/`TryGetValue`.
**Resolution:** Resolved 2026-05-22 — switched `_lastPublishedByRef` to `ConcurrentDictionary<string, object>` so the `TryGetValue`/indexer-write accesses in `ShouldPublish` are thread-safe under concurrent multi-subscription `onChange` callbacks; added a concurrent-deadband-subscription regression test.
### Driver.Modbus-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ModbusDriver.cs:127-186` |
| Status | Resolved |
**Description:** `ShutdownAsync` never clears `_tagsByName`, and `InitializeAsync` repopulates it with `_tagsByName[t.Name] = t` (`ModbusDriver.cs:134`) without clearing first. `ReinitializeAsync` calls `ShutdownAsync` then `InitializeAsync`. Because `_options.Tags` is fixed for a driver instance, the same set re-inserts harmlessly today — but the asymmetry is a latent bug: any future path that re-runs init with a different tag set leaves stale tag entries that resolve reads/writes against deleted nodes. `_lastPublishedByRef` and `_lastWrittenByRef` similarly survive a Reinitialize, retaining deadband/write-suppression baselines against the old config, while `_autoProhibited` *is* deliberately cleared (`ModbusDriver.cs:179`) — the inconsistency shows the clearing was simply overlooked.
**Recommendation:** Clear `_tagsByName`, `_lastPublishedByRef`, and `_lastWrittenByRef` in `ShutdownAsync` (or at the top of `InitializeAsync`) so a Reinitialize starts from a clean state, consistent with the existing `_autoProhibited.Clear()`.
**Resolution:** Resolved 2026-05-22 — added `_tagsByName.Clear()`, `_lastPublishedByRef.Clear()`, and `_lastWrittenByRef.Clear()` to `ShutdownAsync` (via the new shared `TeardownAsync` helper) so a `ReinitializeAsync` cycle always starts from a clean state, consistent with the existing `_autoProhibited.Clear()`.
### Driver.Modbus-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `ModbusDriver.cs:59,188,241,259,266,726,745,759` |
| Status | Open |
**Description:** `_health` is a non-`volatile` reference field written from multiple threads (concurrent `ReadAsync` callers, the coalesced-read path, `WriteAsync` indirectly, and `ProbeLoopAsync`) and read by `GetHealth()`. Reference assignment is atomic on .NET so a torn read cannot occur, but there is no happens-before ordering: a stale `DriverHealth` can be observed on another core, and concurrent writers race so "last write wins" is non-deterministic (a `Degraded` write from a failed read can clobber a just-published `Healthy`, or vice versa).
**Recommendation:** Mark `_health` `volatile`, or assign via `Volatile.Write` and read with `Volatile.Read`, to give `GetHealth()` a defined ordering guarantee.
**Resolution:** _(open)_
### Driver.Modbus-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `ModbusDriver.cs:1468-1473` |
| Status | Resolved |
**Description:** `DisposeAsync()` only disposes `_transport`. Unlike `ShutdownAsync`, it does not cancel/dispose `_probeCts` or `_reprobeCts`, nor dispose `_poll` (the `PollGroupEngine`). A caller that uses `await using` or `using` without first calling `ShutdownAsync` leaks the probe loop, the re-probe loop, and every active polled subscription background `Task`/`CancellationTokenSource`. The two `Task.Run` loops keep running against a disposed transport, throwing on every tick. `Dispose()` (sync) has the same gap and additionally blocks on the async path via `GetAwaiter().GetResult()`.
**Recommendation:** Make `DisposeAsync` perform the same teardown as `ShutdownAsync` (cancel both CTSs, dispose them, dispose `_poll`) before disposing `_transport`. Have `ShutdownAsync` and `DisposeAsync` share a private `TeardownAsync` helper.
**Resolution:** Resolved 2026-05-22 — refactored teardown into a shared `TeardownAsync` helper; `DisposeAsync` now delegates to it, cancelling both CTS objects, disposing `_poll`, and disposing `_transport` — matching `ShutdownAsync` and eliminating the probe/re-probe/poll-engine leak on `await using` callers.
### Driver.Modbus-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `ModbusDriver.cs:777-798,323-330` |
| Status | Resolved |
**Description:** `ReadRegisterBlockAsync` and `ReadBitBlockAsync` index `resp[1]` and call `Buffer.BlockCopy(resp, 2, ..., resp[1])` with no bounds validation. `ModbusTcpTransport.SendOnceAsync` validates only the MBAP length field and the exception high-bit — it does not guarantee a non-exception response PDU is long enough to hold function-code + byte-count + the claimed data. A device (or buggy server) returning a 1-byte PDU, or a byte-count larger than the actual payload, produces an `IndexOutOfRangeException`/`ArgumentException` rather than a clean comms error. `DecodeBitArray` similarly indexes `bitmap[0]` (`ModbusDriver.cs:325`) without checking the bitmap is non-empty. In `ReadAsync` these are caught by the catch-all and mapped to `BadCommunicationError`, so impact is limited; in `ReadCoalescedAsync` the exception is opaque to the narrower catch arms.
**Recommendation:** In `ReadRegisterBlockAsync`/`ReadBitBlockAsync`, validate `resp.Length >= 2` and `resp.Length >= 2 + resp[1]` before slicing, throwing a descriptive `InvalidDataException`. Validate the decoded byte/bit count matches the request quantity.
**Resolution:** Resolved 2026-05-22 — added `resp.Length >= 2`, `resp.Length >= 2 + resp[1]`, and byte-count-vs-quantity checks in both `ReadRegisterBlockAsync` and `ReadBitBlockAsync`, throwing `InvalidDataException` with precise diagnostics; added an empty-bitmap guard in `DecodeBitArray`.
### Driver.Modbus-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `ModbusDriver.cs:514-524,532-550` |
| Status | Resolved |
**Description:** `RunReprobeOnceForTestAsync` reads `_transport` once at the top (`var transport = _transport ?? throw ...`). If `ShutdownAsync` runs (setting `_transport = null` and disposing it) while a re-probe pass is mid-iteration, the loop keeps issuing reads against the captured, disposed transport. `ReprobeLoopAsync` only catches `OperationCanceledException when (ct.IsCancellationRequested)` — an `ObjectDisposedException` from the disposed transport escapes `RunReprobeOnceForTestAsync` and faults the fire-and-forget background `Task`, silently killing the re-probe loop with the wrong failure mode.
**Recommendation:** Re-check `_transport`/cancellation inside the per-candidate loop, or broaden the `ReprobeLoopAsync` catch to also swallow `ObjectDisposedException` when `ct.IsCancellationRequested`.
**Resolution:** Resolved 2026-05-22 — broadened `ReprobeLoopAsync` to catch `ObjectDisposedException when (ct.IsCancellationRequested)` and return cleanly, so a transport disposal race during shutdown exits the background task rather than faulting it.
### Driver.Modbus-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `ModbusDriver.cs:1392`, `ModbusDriverOptions.cs:74-80` |
| Status | Open |
**Description:** Two design-vs-code drifts. (1) `MapDataType` maps `Int64`/`UInt64` to `DriverDataType.Int32` with the inline comment "widening to Int32 loses precision; PR 25 adds Int64 to DriverDataType". The address-space node for a 64-bit Modbus tag is declared `Int32`, misrepresenting the OPC UA variable's `DataType` even though `DecodeRegister` produces a correct `long`/`ulong` value — clients see a type/value mismatch. (2) `DisableFC23` is documented and bound from JSON but is a confirmed no-op ("The driver does not currently emit FC23"). Both are acknowledged-but-unfinished items worth tracking.
**Recommendation:** Track the PR 25 `DriverDataType.Int64` follow-up; until then document the Int32 surfacing limitation in `docs/v2/modbus-addressing.md` so operators configuring `I_64`/`UI_64` tags understand the node type. Mark `DisableFC23` clearly as reserved/unimplemented or gate it once FC23 ships.
**Resolution:** _(open)_
### Driver.Modbus-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `ModbusDriver.cs:411-417,700-703,737-744` |
| Status | Open |
**Description:** Stale/misleading comments. (1) The `<summary>` block at `ModbusDriver.cs:411-417` says auto-prohibited ranges are "Cleared by ReinitializeAsync ... or by an explicit re-probe API (not yet shipped)" — the re-probe loop has shipped (#151, `ReprobeLoopAsync`), so the parenthetical is wrong. (2) The comment at `ModbusDriver.cs:700-703` ("On block-level failure mark every member Bad — caller's per-tag fallback won't re-try since handled-set already includes them; auto-split-on-failure is a follow-up") contradicts the actual `catch (ModbusException)` arm below it, which deliberately does not add members to `handled` and does defer to per-tag fallback (and auto-split has shipped via bisection). The empty `foreach (var (idx, _) in block.Members) { }` loop at `ModbusDriver.cs:737-744`, with only a comment body, is dead code from that superseded design.
**Recommendation:** Update the two comments to match the shipped #148/#150/#151 behaviour and delete the empty `foreach` loop in the `catch (ModbusException)` arm.
**Resolution:** _(open)_
### Driver.Modbus-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `ModbusDriver.cs:1160-1167`, `ModbusTcpTransport.cs:94-95` |
| Status | Open |
**Description:** Two edge cases. (1) `RegisterCount` for `ModbusDataType.String` computes `(tag.StringLength + 1) / 2`; a tag configured with `StringLength = 0` yields a register count of 0, flowing into `ReadOneAsync` as `totalRegs = 0` and producing an FC03/FC04 with quantity 0 — a spec-illegal request the PLC rejects with exception 03. The factory does not reject `StringLength = 0` for String tags. (2) `EnableKeepAlive` casts `opts.Time.TotalSeconds`/`opts.Interval.TotalSeconds` to `int`; a sub-second configured `TimeSpan` (e.g. 500 ms) truncates to 0, which most OSes reject or interpret as "use default", silently defeating the configured keep-alive timing.
**Recommendation:** Validate `StringLength >= 1` for `String` tags in `ModbusDriverFactoryExtensions.BuildTag`. For keep-alive, round up to a minimum of 1 second or validate the configured `TimeSpan` is a whole number of seconds.
**Resolution:** _(open)_
### Driver.Modbus-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `ModbusDriver.cs:864-868`, `ModbusDriverOptions.cs:116-125` |
| Status | Open |
**Description:** When `WriteOnChangeOnly` is enabled and `IsRedundantWrite` returns true, `WriteAsync` returns `WriteResult(0u)` (Good) without touching the wire. The suppression baseline (`_lastWrittenByRef`) is only invalidated by a *read* that returns a divergent value. If a driver instance has `WriteOnChangeOnly = true` but a tag is never subscribed/read (write-only setpoint), a value the operator believes was re-asserted is silently suppressed forever after the first write — no time- or count-based expiry exists. The option XML doc describes the read-invalidation path but does not warn about write-only tags.
**Recommendation:** Document the write-only-tag caveat on the `WriteOnChangeOnly` option, or add an optional TTL to the suppression cache so a periodic re-write still reaches the PLC.
**Resolution:** _(open)_
### Driver.Modbus-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `ModbusDriver.cs:23-43,89-97,408-432` |
| Status | Open |
**Description:** Field and member declarations are interleaved with methods throughout `ModbusDriver`. `ResolveHost` (a public method) is the first member of the class, followed by `BuildSlaveHostName`, then a block of fields; `_lastPublishedByRef`/`_lastWrittenByRef` are declared after the constructor; `ProhibitionState`, `_autoProhibited`, and `_reprobeCts` are declared mid-file between `DecodeRegisterArray` and `RangeIsAutoProhibited`. There are also two near-identical `<summary>` blocks stacked back-to-back at `ModbusDriver.cs:411-423`. This hurts readability of a 1400-line file and makes the field inventory hard to audit (relevant to the thread-safety findings above).
**Recommendation:** Group all instance fields at the top of the class, move nested types together, and remove the orphaned first `<summary>` at lines 411-417 that no longer precedes a member.
**Resolution:** _(open)_
### Driver.Modbus-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests/` |
| Status | Open |
**Description:** The unit suite is broad (coalescing, bisection, auto-recovery, byte order, arrays, BCD, RMW, caps, multi-unit, probe, reconnect, subscription). Gaps relative to the findings above: (1) no test exercises concurrent multi-subscription publishing, so the `_lastPublishedByRef` race (Driver.Modbus-001) is uncaught; (2) no test covers `ReinitializeAsync` state hygiene for stale `_tagsByName`/caches (Driver.Modbus-002); (3) no test feeds a malformed/short response PDU through `ReadRegisterBlockAsync`/`DecodeBitArray` to confirm a clean `BadCommunicationError` rather than an index-range crash (Driver.Modbus-005); (4) no test asserts `DisposeAsync` (vs `ShutdownAsync`) tears down the probe/re-probe loops and `_poll` (Driver.Modbus-004).
**Recommendation:** Add unit tests for concurrent deadband publishing across two subscriptions, `ReinitializeAsync` state hygiene, malformed-response handling in the register/bit block readers, and `DisposeAsync` loop teardown.
**Resolution:** _(open)_
+252
View File
@@ -0,0 +1,252 @@
# Code Review — Driver.OpcUaClient
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 2 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.OpcUaClient-001, -002, -003, -010, -011 |
| 2 | OtOpcUa conventions | Driver.OpcUaClient-004 |
| 3 | Concurrency & thread safety | Driver.OpcUaClient-005, -006, -007 |
| 4 | Error handling & resilience | Driver.OpcUaClient-002, -008, -009 |
| 5 | Security | Driver.OpcUaClient-012 |
| 6 | Performance & resource management | Driver.OpcUaClient-013, -014 |
| 7 | Design-document adherence | Driver.OpcUaClient-004, -013, -015 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.OpcUaClient-015 |
| 10 | Documentation & comments | Driver.OpcUaClient-011 |
## Findings
### Driver.OpcUaClient-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientDriver.cs:444`, `:466`, `:517`, `:540`, `:599`, `:610` |
| Status | Resolved |
**Description:** ReadAsync, WriteAsync, and DiscoverAsync capture the session into a local variable via RequireSession() before acquiring `_gate`, then perform the wire call on that captured reference inside the gate. The reconnect path (OnReconnectComplete, line 1330) swaps `Session` to a brand-new ISession. A read that captured the pre-reconnect session at line 444, then blocked on `_gate.WaitAsync` while a reconnect completed, issues ReadAsync against a stale/closed session. The catch block then fans out BadCommunicationError for the whole batch even though the driver is healthy on the new session, and the operation is silently lost. The gate does not protect against the session being swapped underneath a waiter.
**Recommendation:** Re-read `Session` inside the `_gate` critical section (after WaitAsync returns), or route the session swap itself through `_gate` so a swap cannot interleave with a gated operation.
**Resolution:** Resolved 2026-05-22 — ReadAsync/WriteAsync/DiscoverAsync now re-read `Session` (and parse NodeIds) inside the `_gate` critical section after `WaitAsync` returns; a session swapped in by a concurrent reconnect is the one used for the wire call.
### Driver.OpcUaClient-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:1330-1359` |
| Status | Resolved |
**Description:** OnReconnectComplete handles only the success case. When SessionReconnectHandler gives up (its retry loop exhausts the 2-minute maxReconnectPeriod), it invokes the callback with `handler.Session == null`. The code sets `Session = null`, disposes the handler, and sets `_reconnectHandler = null`, but leaves `_health` at whatever it was (typically Degraded) and `_hostState` at Stopped. There is no further reconnect attempt (the handler is gone, and OnKeepAlive only fires on a live session which no longer exists), and DriverState is never set to Faulted. The driver is permanently wedged: no session, no reconnect loop, no Faulted signal for the Core, and ReinitializeAsync is never triggered. This is the single largest gateway resilience gap.
**Recommendation:** In OnReconnectComplete, when newSession is null, set `_health` to a Faulted DriverHealth with an explanatory message so the Core can fan out Bad quality and offer an operator reinitialize. Consider re-arming a fresh reconnect attempt rather than giving up entirely for an always-on gateway.
**Resolution:** Resolved 2026-05-22 — OnReconnectComplete's give-up branch now transitions HostState to Faulted, sets a Faulted DriverHealth with an explanatory message, and re-arms a fresh SessionReconnectHandler (`TryRearmReconnect`) against the last-known session so an always-on gateway self-heals while the Core can still offer an operator reinitialize.
### Driver.OpcUaClient-003
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientDriver.cs:644-711` |
| Status | Resolved |
**Description:** BrowseRecursiveAsync calls session.BrowseAsync with `requestedMaxReferencesPerNode: 0` but never follows browse continuation points. OPC UA servers enforce a server-side max-references-per-node limit; when a node has more children than the server returns in one response, BrowseResult.ContinuationPoint is non-empty and the caller must issue BrowseNext to retrieve the remainder. This driver discards the continuation point, so any folder on the remote server with a large child set is silently truncated: discovered tags go missing from the local address space with no error. For the tens-of-thousands-of-nodes scenario the options doc targets (MaxDiscoveredNodes = 10000), this is a realistic and silent data-completeness bug.
**Recommendation:** After processing resp.Results[0].References, check resp.Results[0].ContinuationPoint; while non-empty, call session.BrowseNextAsync and append the additional references before recursing/registering.
**Resolution:** Resolved 2026-05-22 — BrowseRecursiveAsync now loops on the BrowseResult.ContinuationPoint, calling `session.BrowseNextAsync` and appending each page of references until the continuation point is empty, so large remote folders are no longer silently truncated.
### Driver.OpcUaClient-004
| Field | Value |
|---|---|
| Severity | High |
| Category | Design-document adherence |
| Location | `OpcUaClientDriver.cs:596-632`, `:789`, `OpcUaClientDriverOptions.cs` |
| Status | Resolved |
**Description:** docs/v2/driver-specs.md section 8 mandates two features that are absent. (1) Namespace remapping: the spec requires building a bidirectional namespace map at connect time from session.NamespaceUris. The driver instead stores the raw upstream NodeId string (pv.NodeId.ToString()) as DriverAttributeInfo.FullName and re-parses it verbatim for reads/writes. The namespace index embedded in `ns=N;...` is server-session-relative; if the upstream server reorders its namespace table across a restart (permitted by the spec), every stored ns=N reference points at the wrong namespace and reads/writes silently address wrong nodes. (2) TargetNamespaceKind enforcement: section 8 requires the driver to enforce Equipment-vs-SystemPlatform choice at startup and fail draft validation on misconfiguration; OpcUaClientDriverOptions has no such knob.
**Recommendation:** Build a namespace-URI map from session.NamespaceUris at connect time and store NodeIds in a server-stable form (namespace URI plus identifier) rather than session-relative ns=N. Add the TargetNamespaceKind option and the startup validation section 8 describes, or document explicitly why the design deviates.
**Resolution:** Resolved 2026-05-22 — new `NamespaceMap` (built from session.NamespaceUris at connect and rebuilt on reconnect) persists discovered NodeIds in the server-stable `nsu=<uri>;…` form; reads/writes re-resolve that form against the current session so a remote namespace-table reorder no longer misaddresses nodes. Added the `TargetNamespaceKind` option + `UnsMappingTable` and `ValidateNamespaceKind`, which fails draft validation for an Equipment instance lacking a UNS mapping or a SystemPlatform instance carrying one.
### Driver.OpcUaClient-005
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1297-1319` |
| Status | Resolved |
**Description:** OnKeepAlive reads and writes `_reconnectHandler` without any lock: `if (_reconnectHandler is not null) return;` followed by `_reconnectHandler = new SessionReconnectHandler(...)`. Keep-alive callbacks are raised from the SDK keep-alive timer thread; on a bad keep-alive the SDK can fire the handler repeatedly while the channel stays down. Two callbacks racing through the check-then-set both observe null, both construct a SessionReconnectHandler, both call BeginReconnect, and the second assignment overwrites the first handler, leaking the first handler (its retry loop keeps running, unreferenced and never disposed) and creating two competing reconnect loops. ShutdownAsync then only cancels/disposes the one that won the assignment race.
**Recommendation:** Guard the `_reconnectHandler` check-and-set with `_probeLock` (already held for `_hostState`), or use Interlocked.CompareExchange to ensure exactly one handler is constructed per drop.
**Resolution:** Resolved 2026-05-22 — the `_reconnectHandler` check-and-set in OnKeepAlive (and the take-and-clear in ShutdownAsync, plus the dispose/re-arm in OnReconnectComplete/TryRearmReconnect) now run inside the `_probeLock` critical section, so exactly one SessionReconnectHandler is constructed per drop and a racing keep-alive callback cannot leak a handler.
### Driver.OpcUaClient-006
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1330-1359` |
| Status | Resolved |
**Description:** OnReconnectComplete mutates `Session` (line 1347) directly from the reconnect-handler callback thread with no synchronization against ReadAsync/WriteAsync/DiscoverAsync/ShutdownAsync. Session is a plain auto-property with no memory barrier; a concurrent reader on another thread may observe a stale reference. ShutdownAsync (line 425) can also run concurrently with OnReconnectComplete: ShutdownAsync disposes the session and sets Session = null while OnReconnectComplete sets Session = newSession, and the interleaving is unspecified, potentially leaving a live session leaked after shutdown.
**Recommendation:** Route all Session mutations through a single lock (or the `_gate`). Make ShutdownAsync cancel the reconnect handler and wait for any in-flight OnReconnectComplete to settle before disposing the session.
**Resolution:** Resolved 2026-05-22 — All Session mutations (assignment to newSession in OnReconnectComplete, and assignment to null in ShutdownAsync) now run inside the `_probeLock` critical section, preventing races between the reconnect callback thread, ShutdownAsync, and keep-alive callbacks. KeepAlive handler detach/attach is also done under `_probeLock` so a keep-alive cannot fire against the old session after the swap.
### Driver.OpcUaClient-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `OpcUaClientDriver.cs:1374`, `:1376-1383`, `:508` |
| Status | Resolved |
**Description:** Two disposal races. (1) Dispose() does `DisposeAsync().AsTask().GetAwaiter().GetResult()`, synchronous blocking on async work. The Galaxy stability review (driver-stability.md, the 2026-04-13 findings) explicitly calls out sync-over-async on the OPC UA stack thread as a closed bug class; if Dispose() runs on the OPC UA stack thread or any thread the SDK continuations need, this deadlocks. (2) DisposeAsync disposes `_gate` (line 1382) after ShutdownAsync returns, but ShutdownAsync does not drain in-flight ReadAsync/WriteAsync operations holding `_gate`. An in-flight read that calls `_gate.Release()` (line 508) after `_gate.Dispose()` throws ObjectDisposedException on a background thread.
**Recommendation:** Provide an async disposal path callers prefer; if a sync Dispose() is unavoidable keep it free of .GetResult() on SDK-thread-affine work. Before disposing `_gate`, acquire it once so all in-flight gated operations have completed, or guard releases against disposal.
**Resolution:** Resolved 2026-05-22 — `Dispose()` no longer calls `.GetAwaiter().GetResult()` on async work; it performs a purely-synchronous teardown (cancel reconnect handler, detach keep-alive, null Session under `_probeLock`). Both `Dispose()` and `DisposeAsync()` now acquire `_gate` once before disposing it, ensuring any in-flight gated operation has released before the gate is torn down.
### Driver.OpcUaClient-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:1092-1099` |
| Status | Resolved |
**Description:** AcknowledgeAsync issues the batched CallAsync and then catches all exceptions with a best-effort empty catch; it also never inspects the per-call results in the success path (`_ = await session.CallAsync(...)`). An alarm acknowledgment the upstream server rejects (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied) is reported as success to the caller. IAlarmSource.AcknowledgeAsync has no per-item result, so the only way a failure could surface is via an exception, and the catch suppresses even that. Operators acking a critical alarm get no signal that the ack did not take.
**Recommendation:** Inspect CallMethodResult.StatusCode for each result and log Bad codes; rethrow (or surface via driver health) genuine transport failures rather than swallowing them. Consider extending the contract so per-ack failures propagate.
**Resolution:** Resolved 2026-05-22 — `AcknowledgeAsync` now inspects each `CallMethodResult.StatusCode` in the success path and logs a Warning for any Bad code (BadConditionAlreadyAcked, BadNodeIdUnknown, BadUserAccessDenied, etc.). `OperationCanceledException` (transport timeout) is now re-thrown instead of swallowed; other transport exceptions are also logged with the driver instance ID. Requires `ILogger<OpcUaClientDriver>` injected via new optional constructor parameter.
### Driver.OpcUaClient-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `OpcUaClientDriver.cs:560-564` |
| Status | Resolved |
**Description:** WriteAsync's catch block fans out BadCommunicationError across the whole batch on any exception. Writes are non-idempotent by default (IWritable remarks, decision #44/#45): a timeout exception may fire after the upstream server already applied the write. Reporting BadCommunicationError (a code that reads as "definitely did not happen") for a write that may have succeeded is misleading; the OPC UA client downstream may safely re-issue and double-apply. The read path has the same fan-out but reads are idempotent so it is benign there; for writes the ambiguity matters.
**Recommendation:** Map write timeouts/cancellations to BadTimeout (which downstream correctly treats as "outcome unknown, do not blindly retry") rather than BadCommunicationError, and only use BadCommunicationError for failures that provably occurred before the request reached the wire.
**Resolution:** Resolved 2026-05-22 — `WriteAsync`'s inner catch block now handles `OperationCanceledException` (timeout/cancellation) separately, mapping it to `BadTimeout` (0x800A0000), while all other exceptions map to `BadCommunicationError`. The session-null pre-wire exit still correctly uses `BadCommunicationError`.
### Driver.OpcUaClient-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `OpcUaClientDriver.cs:823-824` |
| Status | Resolved |
**Description:** MapUpstreamDataType maps DataTypeIds.Byte (the OPC UA unsigned 8-bit type) to DriverDataType.Int16. Byte should map to an unsigned driver type (UInt16 is the smallest unsigned available, matching how SByte belongs with the signed family). Mapping an unsigned 0-255 type onto signed Int16 misrepresents the type metadata downstream: clients see a signed type for an unsigned source, and any range/validation logic keyed off the driver data type is wrong. SByte correctly belongs with Int16; Byte does not.
**Recommendation:** Map DataTypeIds.Byte to DriverDataType.UInt16 (or add a Byte/UInt8 driver type if the enum supports finer granularity), keeping SByte and Int16 on the signed Int16 mapping.
**Resolution:** Resolved 2026-05-22 — `MapUpstreamDataType` now maps `DataTypeIds.Byte``DriverDataType.UInt16` (unsigned family) while `DataTypeIds.SByte` remains on `DriverDataType.Int16` (signed family). Test `MapUpstreamDataType_Byte_maps_to_UInt16_unsigned_family` asserts the fix and `MapUpstreamDataType_maps_Byte_to_UInt16_not_Int16` guards the regression.
### Driver.OpcUaClient-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `OpcUaClientDriver.cs:783-784` |
| Status | Open |
**Description:** The comment on the isArray computation states "-1 = scalar; 1+ = array dimensions; 0 = one-dimensional array". This is inaccurate against OPC UA ValueRank semantics: -3 is ScalarOrOneDimension, -2 is Any, -1 is Scalar, and 0 is OneOrMoreDimensions (not specifically one-dimensional). The code `valueRank >= 0` treats -2 (Any) and -3 (ScalarOrOneDimension) as scalar, which is a defensible default, but the comment misdescribes the constants and would mislead a maintainer.
**Recommendation:** Correct the comment to the actual ValueRank constants (-3 ScalarOrOneDimension, -2 Any, -1 Scalar, 0 OneOrMoreDimensions, 1 OneDimension, >1 multi-dim) and state the deliberate choice that anything >= 0 is treated as an array.
**Resolution:** _(open)_
### Driver.OpcUaClient-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `OpcUaClientDriver.cs:210-217` |
| Status | Resolved |
**Description:** When AutoAcceptCertificates is true the driver registers a CertificateValidation handler that accepts only StatusCodes.BadCertificateUntrusted. A self-signed or otherwise untrusted server certificate frequently fails validation with a different code first (BadCertificateChainIncomplete, BadCertificateTimeInvalid, BadCertificateHostNameInvalid), so auto-accept silently does not accept many real dev certificates and the connect fails confusingly. The handler is added to config.CertificateValidator but never removed; each driver instance leaks a delegate subscription on a validator that may be process-shared. The option doc says auto-accept is dev-only and must be false in production, but there is no runtime guard preventing AutoAcceptCertificates=true shipping to production and no log warning when it is enabled.
**Recommendation:** When auto-accepting for dev, accept the full set of certificate-validation error codes (or use the SDK AutoAcceptUntrustedCertificates path consistently). Emit a prominent warning log every time AutoAcceptCertificates is enabled so a production misconfiguration is visible. Detach the handler on shutdown.
**Resolution:** Resolved 2026-05-22 — The cert-validation handler now accepts ALL validation errors (not only BadCertificateUntrusted) when `AutoAcceptCertificates=true`, so real dev certs with chain/host/time errors work. A `LogWarning` is emitted at startup whenever the flag is set. The handler delegate + validator reference are stored in `_certValidationHandler`/`_certValidatorRef` and detached in both `ShutdownAsync` and `Dispose()`/`DisposeAsync()` to prevent the delegate leak.
### Driver.OpcUaClient-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `OpcUaClientDriver.cs:436-437` |
| Status | Resolved |
**Description:** GetMemoryFootprint() is hard-coded to return 0 and FlushOptionalCachesAsync is a no-op Task.CompletedTask. docs/v2/driver-stability.md section "In-process only (Tier A/B)" makes per-instance allocation tracking a contract requirement, and driver-specs.md section 8 explicitly calls out browse-cache memory: BrowseStrategy=Full against a large remote server can cache tens of thousands of node descriptions and the per-instance budget should bound this. Returning 0 means the Core 30-second footprint poll can never detect this driver's browse-cache growth, and the cache-budget-breach to flush escalation path is dead code. A gateway pointed at a 10k-node server (the configured cap) silently evades the Tier-A memory-guard mechanism.
**Recommendation:** Track an approximate footprint for the discovered-node set and any cached browse state, return it from GetMemoryFootprint(), and implement FlushOptionalCachesAsync to drop droppable cache. If the driver genuinely holds no significant cache, document why 0 is correct.
**Resolution:** Resolved 2026-05-22 — `DiscoverAsync` now updates a `_discoveredNodeCount` volatile counter after each pass. `GetMemoryFootprint()` returns `_discoveredNodeCount * 512` (conservative ~512 bytes per node for DriverAttributeInfo + strings). `FlushOptionalCachesAsync` resets `_discoveredNodeCount` to 0, signalling Core that re-discovery will rebuild cleanly. A 10k-node server now reports ~5 MB to the Core slope alarm rather than 0.
### Driver.OpcUaClient-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `OpcUaClientDriver.cs:904`, `:1035` |
| Status | Open |
**Description:** `MonitoredItem.Notification += (mi, args) => ...` (and the alarm-event equivalent) attaches a closure-capturing lambda to each monitored item's event. The lambda is never detached. When UnsubscribeAsync removes a subscription it calls Subscription.DeleteAsync but does not clear the MonitoredItem.Notification handlers; if the SDK retains the MonitoredItem/Subscription graph anywhere (the session keeps a reference until its own disposal, or during transfer-on-reconnect), the driver instance is kept alive by the closure longer than necessary.
**Recommendation:** Detach the Notification handlers when deleting a subscription, or hold the handler delegate so it can be explicitly removed in UnsubscribeAsync/ShutdownAsync.
**Resolution:** _(open)_
### Driver.OpcUaClient-015
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/*`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/OpcUaClientSmokeTests.cs` |
| Status | Resolved |
**Description:** Unit-test coverage is solid for the pure mappers (MapSeverity, MapUpstreamDataType, MapSecurityPolicy, MapAggregateToNodeId, BuildCertificateIdentity, ResolveEndpointCandidates) and for "throws before init" guards, but the highest-risk behaviours of a gateway driver have no test: the reconnect/session-swap path (OnKeepAlive to OnReconnectComplete, findings -001/-002/-005/-006), browse continuation-point handling (-003), the cascading-quality fan-out on a mid-batch transport failure, and namespace remapping (-004). The reconnect test file itself states wire-level disconnect-reconnect-resume coverage lands with the in-process fixture, i.e. the single largest gateway bug surface (per driver-specs.md section 8) is explicitly untested. The integration suite is Docker-fixture gated against opc-plc and is a smoke test only. The failed-reconnect-to-Faulted and concurrent-keep-alive races are pure-logic paths testable with a fake ISession.
**Recommendation:** Add tests exercising the reconnect callbacks with a stub session (success and give-up cases), a browse test with a paged/continuation-point server stub, and a read-batch test asserting upstream Bad StatusCodes pass through verbatim while a transport throw fans out the local fault code.
**Resolution:** Resolved 2026-05-22 — Added `OpcUaClientMediumFindingsRegressionTests.cs` covering: (1) BadTimeout vs BadCommunicationError status-code distinction for the write-timeout path (Driver.OpcUaClient-009); (2) Byte→UInt16 mapping regression (Driver.OpcUaClient-010); (3) AutoAcceptCertificates warning log assertion (Driver.OpcUaClient-012); (4) GetMemoryFootprint/FlushOptionalCachesAsync contract (Driver.OpcUaClient-013); (5) MapSeverity thresholds, pre-init health, Session null pre-init, GetHostStatuses contract. Wire-level reconnect callback tests remain fixture-gated pending the in-process OPC UA server fixture.
+209
View File
@@ -0,0 +1,209 @@
# Code Review — Driver.S7.Cli
| Field | Value |
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 4 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.S7.Cli-001, Driver.S7.Cli-002 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found |
| 4 | Error handling & resilience | Driver.S7.Cli-001, Driver.S7.Cli-003 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.S7.Cli-004 |
| 7 | Design-document adherence | Driver.S7.Cli-002 |
| 8 | Code organization & conventions | Driver.S7.Cli-005 |
| 9 | Testing coverage | Driver.S7.Cli-006 |
| 10 | Documentation & comments | Driver.S7.Cli-002, Driver.S7.Cli-007 |
## Findings
### Driver.S7.Cli-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/WriteCommand.cs:65-80` |
| Status | Resolved |
**Description:** `WriteCommand.ParseValue` parses numeric and `DateTime` values with the
raw BCL parsers (`short.Parse`, `float.Parse`, `DateTime.Parse`, etc.). On malformed
input these throw `FormatException` / `OverflowException`, which are *not*
`CliFx.Exceptions.CommandException`. CliFx renders a `CommandException` as a clean
one-line error with a non-zero exit code, but renders any other exception as a full
.NET stack trace. The `ParseValue` bool path is handled correctly (it throws
`CommandException` for unrecognised input), so the command is internally inconsistent:
`write -t Bool -v maybe` gives a friendly message while `write -t Int16 -v xyz` dumps a
stack trace. The module own test `ParseValue_non_numeric_for_numeric_types_throws`
asserts the raw `FormatException` leaks, confirming the behaviour is unintended-but-shipped.
**Recommendation:** Wrap the numeric / `DateTime` parses in a `try`/`catch` that
re-throws `FormatException` and `OverflowException` as
`CliFx.Exceptions.CommandException` with a message that names the `--type` and the
offending value — matching the bool path. Update the test to expect `CommandException`.
**Resolution:** Resolved 2026-05-22 — wrapped all numeric/DateTime BCL parses in `try/catch(FormatException)` and `try/catch(OverflowException)` that re-throw as `CommandException` with a message naming the `--type` and the offending value; updated `ParseValue_non_numeric_for_numeric_types_throws` to assert `CommandException`, and added an overflow-edge test.
### Driver.S7.Cli-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/ReadCommand.cs:22-29`, `Commands/WriteCommand.cs:21-33`, `Commands/SubscribeCommand.cs:18-21`; `docs/Driver.S7.Cli.md:70-73,80-81` |
| Status | Resolved |
**Description:** The `--type` option help text on `read`, `write`, and `subscribe`
advertises the full `S7DataType` set (`Int64 / UInt64 / Float64 / String / DateTime`),
and `docs/Driver.S7.Cli.md` shows a worked `read ... -t String --string-length 80`
example plus a `--string-length` flag on `read`/`write`. The underlying `S7Driver`
(`S7Driver.cs:241-245` for reads, `:316-320` for writes) throws `NotSupportedException`
for `Int64`, `UInt64`, `Float64`, `String`, and `DateTime` — the driver maps that to
`BadNotSupported`. Consequently every CLI invocation using one of those types — and the
documented `--string-length` string-read example — fails at runtime with
`0x803D0000 (Bad)`. The CLI surface and docs promise capability the driver does not yet
implement.
**Recommendation:** Either (a) trim the `--type` help text and the `--string-length`
flag/examples to the implemented set (`Bool / Byte / Int16 / UInt16 / Int32 / UInt32 /
Float32`) until the follow-up driver PR lands, or (b) keep the surface but add a one-line
"types beyond Float32 are not yet implemented and surface BadNotSupported" caveat to the
help text and `docs/Driver.S7.Cli.md`. Option (a) is preferred so the CLI does not offer
options that cannot succeed.
**Resolution:** Resolved 2026-05-22 — updated the `--type` help text on `read`, `write`, and `subscribe` to list the implemented set (Bool/Byte/Int16/UInt16/Int32/UInt32/Float32) and appended a one-line caveat that Int64/UInt64/Float64/String/DateTime are not yet implemented and will return BadNotSupported.
### Driver.S7.Cli-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/ProbeCommand.cs:38-50` |
| Status | Resolved |
**Description:** `ProbeCommand` XML doc and the `Driver.S7.Cli.md` "fastest is the
device talking" framing say the probe "connects ... prints health" and "surfaces
`BadNotSupported`" when PUT/GET is disabled. But when the PLC is unreachable (connection
refused, host down, wrong slot), `driver.InitializeAsync` throws and the exception
propagates straight out of `ExecuteAsync` — the code that prints `Host:`, `Health:`,
`Last error:`, and the snapshot is never reached. The most common probe failure (device
not reachable at all) therefore produces a CliFx stack trace rather than the structured
health report the command exists to give. Note PUT/GET-disabled only surfaces during
`ReadAsync` (after a successful connect), so that one path does reach the health print —
but a refused TCP connect does not.
**Recommendation:** Wrap the `InitializeAsync` + `ReadAsync` body in a `try`/`catch` that,
on failure, still prints the `Host:` / `CPU:` lines and a `Health:` / `Last error:`
report derived from `driver.GetHealth()` (which `InitializeAsync` sets to
`Faulted` with the exception message before re-throwing). The probe should report an
unreachable device, not crash on it.
**Resolution:** Resolved 2026-05-22 — wrapped the `InitializeAsync` + `ReadAsync` body in a `try/catch` that on any non-cancellation failure still prints the structured `Host:`, `CPU:`, `Health:`, and `Last error:` lines derived from `driver.GetHealth()`, so an unreachable device produces a health report rather than a stack trace.
### Driver.S7.Cli-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/ProbeCommand.cs:36,53`, `Commands/ReadCommand.cs:45,54`, `Commands/WriteCommand.cs:51,60`, `Commands/SubscribeCommand.cs:39,73` |
| Status | Open |
**Description:** Every command declares the driver with `await using var driver = new
S7Driver(...)` and *also* calls `await driver.ShutdownAsync(...)` in a `finally` block.
`S7Driver.DisposeAsync` itself calls `ShutdownAsync`, so shutdown runs twice per command
(three times for `subscribe`, which also unsubscribes). `ShutdownAsync` is idempotent
(`Plc?.Close()` is best-effort, `_subscriptions` is cleared) so there is no functional
bug, but the explicit `finally`-block `ShutdownAsync` call is redundant given the
`await using`. It is also slightly misleading — a reader may assume the `await using` is
not actually disposing.
**Recommendation:** Drop the explicit `await driver.ShutdownAsync(...)` from the
`finally` blocks and rely on `await using` for teardown; keep only the
`subscribe` command `UnsubscribeAsync`. Alternatively drop `await using`
and keep the explicit `finally`. Pick one disposal mechanism per command.
**Resolution:** _(open)_
### Driver.S7.Cli-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/` |
| Status | Open |
**Description:** A stale directory `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/`
exists containing only an `obj/` folder — no `.csproj`, no source. The real test
project lives at `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/`. The empty
directory is a leftover from the project move into `tests/Drivers/Cli/` and is not
referenced by `ZB.MOM.WW.OtOpcUa.slnx`. It is dead clutter that can mislead anyone
grepping the tree for the S7 CLI test project.
**Recommendation:** Delete the stale `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/`
directory (including its `obj/`). This is outside the module `src/` tree but is the
S7 CLI own orphaned test folder, so it belongs to this module cleanup.
**Resolution:** _(open)_
### Driver.S7.Cli-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/WriteCommandParseValueTests.cs` |
| Status | Open |
**Description:** The only test file covers `WriteCommand.ParseValue` and
`ReadCommand.SynthesiseTagName`. `S7CommandBase.BuildOptions` — which maps the
host / port / CPU / rack / slot / timeout flags onto an `S7DriverOptions` and forces
`Probe.Enabled = false` — has no test, despite being pure, deterministic, and
`internal`-visible to the test assembly via `InternalsVisibleTo`. A regression that
dropped `Probe = new S7ProbeOptions { Enabled = false }` (which would start an
unwanted background probe loop in a one-shot CLI run) or mis-mapped `TimeoutMs` would
not be caught. `ParseValue` is also missing an explicit overflow-edge test (e.g.
`Byte` value `256`) — the current `ParseValue_Byte_ranges` test stops at `255`.
**Recommendation:** Add a `BuildOptions` test (assert `Probe.Enabled == false`,
`Timeout` matches `TimeoutMs`, and host/port/CPU/rack/slot flow through). Add an
overflow case to the `ParseValue` numeric tests once Driver.S7.Cli-001 is resolved so
the test asserts the wrapped `CommandException`.
**Resolution:** _(open)_
### Driver.S7.Cli-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/SubscribeCommand.cs:45-51` |
| Status | Open |
**Description:** The Modbus CLI `SubscribeCommand` carries an explanatory comment on
the `OnDataChange` handler ("Route every data-change event to the CliFx console (not
System.Console — the analyzer flags it + IConsole is the testable abstraction)"). The S7
`SubscribeCommand` is a near-verbatim copy but dropped that comment, so the non-obvious
reason the handler uses `console.Output.WriteLine` (synchronous, on a driver background
thread) instead of `System.Console` or the `async` `WriteLineAsync` is undocumented here.
Minor, but the rationale is worth keeping consistent across the CLI family.
**Recommendation:** Re-add the one-line comment from the Modbus `SubscribeCommand` so
the S7 copy explains why the event handler writes via `console.Output` synchronously.
**Resolution:** _(open)_
+408
View File
@@ -0,0 +1,408 @@
# Code Review — Driver.S7
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.S7` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.S7-001, Driver.S7-002, Driver.S7-003 |
| 2 | OtOpcUa conventions | Driver.S7-004, Driver.S7-005 |
| 3 | Concurrency & thread safety | Driver.S7-006 |
| 4 | Error handling & resilience | Driver.S7-007, Driver.S7-008, Driver.S7-009 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.S7-010 |
| 7 | Design-document adherence | Driver.S7-011, Driver.S7-012 |
| 8 | Code organization & conventions | Driver.S7-013 |
| 9 | Testing coverage | Driver.S7-014 |
| 10 | Documentation & comments | Driver.S7-012 (shared) |
## Findings
### Driver.S7-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `S7AddressParser.cs:93`, `S7Driver.cs:231` |
| Status | Resolved |
**Description:** S7AddressParser.Parse accepts Timer (T0) and Counter (C0)
addresses and the test suite asserts they parse successfully, but the read path
cannot serve them. Two problems compound: (1) ReadOneAsync type-mapping switch
(lines 231-250) has no case for any Timer/Counter combination, so a Timer/Counter
tag falls through to the default arm and throws InvalidDataException with a
misleading "type-mismatch" message on every read; (2) the read is issued via
plc.ReadAsync(tag.Address, ...) passing the raw address string, and S7.Net
string-based parser does not understand T{n}/C{n} syntax. A tag configured with a
timer or counter address passes init-time parsing (the docstring promises config
typos fail fast at init) and then fails on every read - exactly the
un-diagnosable failure mode the fail-fast parse was meant to prevent.
**Recommendation:** Either drop Timer/Counter from S7AddressParser and S7Area
until they are wired through to S7.Net, or implement the Timer/Counter read path.
If kept, reject Timer/Counter tags at InitializeAsync with a clear "not yet
supported" error rather than letting them parse clean.
**Resolution:** Resolved 2026-05-22 — `InitializeAsync` now runs
`RejectUnsupportedTagAddresses`, which throws `NotSupportedException` with a
clear "not yet supported" message (echoing the tag name + address) for any tag
whose address parses as a Timer or Counter, so the bad config fails fast at init
rather than throwing a misleading type-mismatch on every read.
### Driver.S7-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `S7Driver.cs:350` |
| Status | Resolved |
**Description:** MapDataType collapses S7DataType.UInt32 to DriverDataType.Int32.
UInt32 values above int.MaxValue (2^31-1) wrap to negative when surfaced to the
OPC UA client, silently corrupting the value. The inline comment only flags
Int64/UInt64 as "widens; lossy" but UInt32 to Int32 is equally lossy and is not
called out.
**Recommendation:** Map UInt32/UInt16 to a DriverDataType wide enough to hold the
unsigned range, or add the missing unsigned DriverDataType members. At minimum
correct the comment so the lossiness of UInt32 is documented.
**Resolution:** Resolved 2026-05-22 — added an inline comment to the `MapDataType` switch explicitly documenting the UInt32→Int32 lossiness (same limitation as Int64/UInt64, tracked for a follow-up PR adding unsigned DriverDataType members); the code mapping is unchanged pending that follow-up.
### Driver.S7-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `S7Driver.cs:172`, `S7Driver.cs:255` |
| Status | Open |
**Description:** ReadAsync and WriteAsync dereference fullReferences.Count /
writes.Count with no null guard. A null argument throws NullReferenceException
rather than ArgumentNullException, and the NRE escapes before the _gate is taken
so it is not wrapped in a per-item status. DiscoverAsync correctly uses
ArgumentNullException.ThrowIfNull(builder); the read/write entry points are
inconsistent with it.
**Recommendation:** Add ArgumentNullException.ThrowIfNull for the list parameters
at the top of ReadAsync and WriteAsync.
**Resolution:** _(open)_
### Driver.S7-004
| Field | Value |
|---|---|
| Severity | Medium |
| Category | OtOpcUa conventions |
| Location | `S7Driver.cs` (whole file) |
| Status | Resolved |
**Description:** The driver performs no logging. CLAUDE.md Library Preferences
mandate Serilog with a rolling daily file sink. Every error path is an empty
catch block (Initialize cleanup line 130, ShutdownAsync lines 142/149/153,
ProbeLoop line 483, PollLoop lines 396/406, Dispose line 511). Connection faults,
probe transitions, PUT/GET-disabled config errors, and poll-loop exceptions are
all silently swallowed. An operator has only the DriverHealth.LastError string
and no event trail to diagnose an intermittent PLC.
**Recommendation:** Inject an ILogger/ILoggerFactory and log connect
success/failure, probe Running/Stopped transitions, PUT/GET-disabled detection,
and swallowed poll-loop / shutdown exceptions.
**Resolution:** Resolved 2026-05-22 — injected `ILogger<S7Driver>` (optional, defaults to `NullLogger`) into the primary constructor; added structured log calls for connect success/failure, probe Running/Stopped transitions, and swallowed poll-loop exceptions, giving operators an event trail via Serilog.
### Driver.S7-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `S7Driver.cs:33`, `S7Driver.cs:433` |
| Status | Open |
**Description:** System.Collections.Concurrent.ConcurrentDictionary is written
out with a fully-qualified namespace at the field declarations instead of a
using System.Collections.Concurrent directive. ImplicitUsings is enabled and the
rest of the codebase relies on using directives; the inline FQN is inconsistent
with house style. Similar redundant global::S7.Net.* qualifiers appear throughout
S7Driver.cs despite the file-top using S7.Net.
**Recommendation:** Add using System.Collections.Concurrent and drop the
redundant global::S7.Net. qualifiers where using S7.Net already covers them.
**Resolution:** _(open)_
### Driver.S7-006
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `S7Driver.cs:140`, `S7Driver.cs:457`, `S7Driver.cs:506` |
| Status | Resolved |
**Description:** Disposal races with the in-flight probe / poll tasks.
ShutdownAsync calls _probeCts.Cancel() and cancels each subscription CTS, but it
does not await the ProbeLoopAsync / PollLoopAsync tasks (they are fire-and-forget
Task.Run with the task handle discarded). DisposeAsync then calls ShutdownAsync
followed immediately by _gate.Dispose(). A probe or poll iteration that is
between _gate.WaitAsync and _gate.Release() when cancellation fires will call
Release() (line 479) or have WaitAsync observe a disposed semaphore -
ObjectDisposedException. The probe loop broad catch swallows it, but the
disposal-ordering bug is real: the semaphore can be disposed while a worker still
holds or is waiting on it. The same applies to _probeCts.Dispose() (line 143)
running while ProbeLoopAsync may still touch the linked token.
**Recommendation:** Track the probe and poll Task handles, and in ShutdownAsync
(or DisposeAsync) await Task.WhenAll(...) with a bounded timeout after cancelling,
before disposing _gate and the CTS objects.
**Resolution:** Resolved 2026-05-22 — the probe loop now stores its Task in
`_probeTask` and each subscription records its poll Task in `SubscriptionState.PollTask`.
`ShutdownAsync` cancels every CTS, awaits `Task.WhenAll` of those handles with a
bounded 5 s `DrainTimeout`, and only then disposes `_probeCts`, the subscription
CTSs, and (via `DisposeAsync`) `_gate` — so no loop can touch a disposed
semaphore. `Task.Run` is now passed `CancellationToken.None` so the handle is
always awaitable even if the token is already cancelled.
### Driver.S7-007
| Field | Value |
|---|---|
| Severity | High |
| Category | Error handling & resilience |
| Location | `S7Driver.cs:200`, `S7DriverOptions.cs:13`, `docs/v2/driver-specs.md:434` |
| Status | Resolved |
**Description:** PUT/GET-disabled handling contradicts the design and the
module own docstring. driver-specs.md section 5 (line 434) and the
S7DriverOptions class remark both state PUT/GET-disabled must be mapped to
BadNotSupported and surfaced as a configuration alert, not a transient fault,
because blind retry is wasted effort. The actual code (ReadAsync, lines 200-208)
catches every S7.Net.PlcException and maps it to StatusBadDeviceFailure, then
sets health to Degraded. Consequences: (1) a genuinely transient PlcException
(e.g. CPU briefly in STOP) is reported identically to a permanent PUT/GET
misconfiguration - the operator cannot tell a config problem from a transient
one, which is the exact distinction the spec demands; (2) the promised
BadNotSupported status code is never produced, so the S7DriverOptions docstring
is now false.
**Recommendation:** Inspect PlcException.ErrorCode and map the
PUT/GET-disabled / access-denied code to BadNotSupported with a distinct
config-alert health state; keep BadDeviceFailure/Degraded only for genuine
device-fault error codes.
**Resolution:** Resolved 2026-05-22 — `ReadAsync` / `WriteAsync` now split the
`PlcException` catch via an `IsAccessDenied` filter. S7.Net exposes no typed
error code for the S7 `AccessingObjectNotAllowed` status (its
`ValidateResponseCode` throws a plain `Exception` wrapped as the inner exception
of a `PlcException`), so `IsAccessDenied` walks the inner-exception chain for the
"not allowed" marker. A PUT/GET-disabled fault now maps to `BadNotSupported` and
sets health to `Faulted` with a config-alert message pointing operators at the
TIA Portal PUT/GET toggle; a genuine device fault still maps to
`BadDeviceFailure`/`Degraded`.
### Driver.S7-008
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `S7Driver.cs:286` |
| Status | Resolved |
**Description:** WriteAsync catch ladder is coarser than ReadAsync and loses
information. The generic catch (Exception) maps everything - socket errors,
timeouts, OverflowException from Convert.ToInt16 of an out-of-range value,
NullReferenceException from Convert.ToBoolean(null) - to StatusBadInternalError.
A genuine transport fault during a write is reported to the client as an internal
error rather than BadCommunicationError, and unlike ReadAsync the write path never
updates _health on failure, so a PLC that is down stays Healthy in the dashboard
as long as only writes are attempted. OperationCanceledException is also caught
and turned into a status code rather than propagating.
**Recommendation:** Mirror the ReadAsync catch structure: let
OperationCanceledException propagate, map socket/timeout faults to
BadCommunicationError, map value-conversion failures to a distinct out-of-range
status, and update _health to Degraded on transport failures.
**Resolution:** Resolved 2026-05-22 — restructured `WriteAsync` catch ladder: `OperationCanceledException` now re-throws, genuine `PlcException` transport faults map to `BadDeviceFailure`/`Degraded`, `NotSupportedException` maps to `BadNotSupported`, the `IsAccessDenied` PlcException path maps to `BadNotSupported`/`Faulted`, and the catch-all maps to `BadCommunicationError` with a health update — matching `ReadAsync`'s structure.
### Driver.S7-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `S7Driver.cs:392` |
| Status | Open |
**Description:** The subscription poll loop never reflects sustained polling
failure anywhere an operator can see it. PollLoopAsync swallows every
non-cancellation exception with an empty catch and the comment claims "the health
surface reflects it" - but a poll failure routes through ReadAsync, which only
sets DriverState.Degraded when the per-tag read throws inside the gate;
exceptions thrown before that (e.g. RequirePlc() when Plc is null after a drop)
bypass the health update entirely. A subscription against an uninitialized or
dropped driver loops forever silently, with no backoff - re-polling every
Interval indefinitely on a hard failure.
**Recommendation:** Have the poll loop update _health on repeated failure and
apply a capped backoff after consecutive errors; at minimum log the swallowed
exception (see Driver.S7-004).
**Resolution:** _(open)_
### Driver.S7-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `S7Driver.cs:504` |
| Status | Open |
**Description:** Dispose() is implemented as
DisposeAsync().AsTask().GetAwaiter().GetResult() - sync-over-async. Inside the
generic host this is currently safe (no captured SynchronizationContext), but it
is a known deadlock pattern. The only async work behind DisposeAsync is
ShutdownAsync, which does nothing async (returns Task.CompletedTask). The
blocking wrap is unnecessary risk.
**Recommendation:** Since ShutdownAsync is effectively synchronous, have Dispose()
perform the teardown directly (cancel CTSs, close Plc, dispose _gate) without
round-tripping through the async path.
**Resolution:** _(open)_
### Driver.S7-011
| Field | Value |
|---|---|
| Severity | High |
| Category | Design-document adherence |
| Location | `S7Driver.cs:82`, `S7Driver.cs:134`, `IDriver.cs:24` |
| Status | Resolved |
**Description:** S7Driver ignores the driverConfigJson parameter on both
InitializeAsync and ReinitializeAsync. The IDriver contract states InitializeAsync
initializes the driver "from its DriverConfig JSON" and ReinitializeAsync "applies
a config change in place". All configuration is instead captured in the
constructor (S7DriverOptions options), and ReinitializeAsync simply calls
ShutdownAsync then InitializeAsync with the same options object. Consequently a
config change delivered to ReinitializeAsync (the documented IGenerationApplier
recovery path per driver-stability.md) is silently discarded - the driver
re-opens with the old config. This breaks the only Core-initiated in-process
recovery path.
**Recommendation:** Either re-parse driverConfigJson inside
InitializeAsync/ReinitializeAsync and rebuild _options from it, or document
explicitly that S7 reconfiguration requires instance recreation and have
ReinitializeAsync signal that the passed JSON is unused so the contract mismatch
is visible.
**Resolution:** Resolved 2026-05-22 — config parsing was factored out of the
factory into `S7DriverFactoryExtensions.ParseOptions`. `InitializeAsync` (and
therefore `ReinitializeAsync`, which delegates to it) now re-parses
`driverConfigJson` and rebuilds `_options` from it whenever the document carries
a real body, so a config change delivered through `ReinitializeAsync` — the only
Core-initiated in-process recovery path — is honoured. An empty / placeholder
document (`""`, `{}`, `[]`) keeps the constructor-supplied options so existing
lifecycle unit tests that pass `"{}"` are unaffected.
### Driver.S7-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `S7DriverOptions.cs:59`, `S7Driver.cs:457` |
| Status | Resolved |
**Description:** S7ProbeOptions.ProbeAddress is configured (default "MW0"),
documented at length ("the driver runs a tick loop that issues a cheap read
against S7ProbeOptions.ProbeAddress"), surfaced in the factory DTO
(S7ProbeDto.ProbeAddress), and parsed from JSON - but it is never read by any
code. ProbeLoopAsync probes liveness via plc.ReadStatusAsync (CPU status), not via
a read of ProbeAddress. The XML doc on the S7DriverOptions.Probe property and on
ProbeAddress describes behaviour the driver does not implement. An operator who
sets ProbeAddress to a known-good DB word expecting the probe to exercise it will
see no effect.
**Recommendation:** Either make ProbeLoopAsync actually read ProbeAddress
(parsing it once at init and rejecting a bad value early), or delete ProbeAddress
from S7ProbeOptions/S7ProbeDto and correct the XML docs to describe the
ReadStatusAsync-based probe.
**Resolution:** Resolved 2026-05-22 — removed `ProbeAddress` from `S7ProbeOptions` and `S7ProbeDto`; updated the `S7DriverOptions.Probe` XML doc to describe the `ReadStatusAsync`-based probe accurately. Existing configs that set `probeAddress` are silently ignored (unknown JSON fields are tolerated by the deserializer).
### Driver.S7-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `S7DriverOptions.cs:90`, `S7Driver.cs:300` |
| Status | Open |
**Description:** S7TagDefinition.StringLength is a public configured/JSON-bound
parameter (default 254) but is dead: S7DataType.String reads and writes both
throw NotSupportedException ("...land in a follow-up PR"), so StringLength is
never consumed. Likewise S7DataType.Int64, UInt64, Float64, String, and DateTime
are exposed as configurable, browse through MapDataType into real DriverDataType
values, and pass DiscoverAsync - creating address-space nodes - yet every
read/write of them throws NotSupportedException, becoming BadNotSupported. A site
can configure a Float64 tag, see the node appear, and get BadNotSupported on
every access. The scaffold/follow-up-PR split leaks half-implemented types into
the configurable surface.
**Recommendation:** Reject the not-yet-implemented S7DataType values (and
StringLength) at InitializeAsync / factory validation with a clear "not yet
supported" error, so a partially-implemented type cannot be configured into a
live address space.
**Resolution:** _(open)_
### Driver.S7-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/` |
| Status | Resolved |
**Description:** Test coverage has notable gaps for the driver behavioural
core: (1) no test exercises the ReadOneAsync type-reinterpret switch (Int16 from
ushort, Int32 from uint, Float32 from UInt32 bits) - the most logic-heavy method
in the driver is untested, and the unsigned/signed unchecked casts are
unverified; (2) no test covers a Timer/Counter tag end-to-end, which would have
caught Driver.S7-001; (3) no test covers WriteOneAsync boxing conversions or
the out-of-range Convert failure paths; (4) the read-write tests only cover error
paths (uninitialized, bad address) - the happy path is explicitly deferred to "a
follow-up PR" with no mock S7 server, leaving the entire successful read, write,
poll, and probe-transition surface untested; (5) ReinitializeAsync and the
driverConfigJson-ignored behaviour (Driver.S7-011) has no test.
**Recommendation:** Add unit tests for ReadOneAsync/WriteOneAsync type mapping by
factoring the pure reinterpret/boxing logic out of the PLC round-trip so it is
testable without a live PLC, and add a Timer/Counter rejection test. Track the
live/mock-server happy-path coverage as an explicit follow-up rather than an
open-ended deferral.
**Resolution:** Resolved 2026-05-22 — factored `ReadOneAsync` type-reinterpret into `internal static ReinterpretRawValue` and `WriteOneAsync` boxing into `internal static BoxValueForWrite`; added `S7TypeMappingTests.cs` (26 tests) covering every implemented type round-trip (Bool/Byte/UInt16/Int16/UInt32/Int32/Float32), unsupported-type `NotSupportedException` assertions, and write overflow paths.
+202
View File
@@ -0,0 +1,202 @@
# Code Review — Driver.TwinCAT.Cli
| Field | Value |
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Cli` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 7 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.TwinCAT.Cli-001 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.TwinCAT.Cli-002 |
| 4 | Error handling & resilience | Driver.TwinCAT.Cli-003 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | Driver.TwinCAT.Cli-004 |
| 8 | Code organization & conventions | Driver.TwinCAT.Cli-005 |
| 9 | Testing coverage | Driver.TwinCAT.Cli-006 |
| 10 | Documentation & comments | Driver.TwinCAT.Cli-007 |
## Findings
<!-- One ### entry per finding. IDs are <Module>-NNN, sequential within the module,
never reused. Findings are never deleted — close them by changing Status and
completing Resolution. -->
### Driver.TwinCAT.Cli-001
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `TwinCATCommandBase.cs:23-24`, `Commands/SubscribeCommand.cs:23-24`, `Commands/BrowseCommand.cs:21-24` |
| Status | Open |
**Description:** Numeric command options are accepted without range validation. `--timeout-ms`
feeds `Timeout => TimeSpan.FromMilliseconds(TimeoutMs)`; passing `--timeout-ms 0` or a negative
value yields `TimeSpan.Zero`/a negative `TimeSpan`, which is then handed to the driver's
`TwinCATDriverOptions.Timeout` and on to `ITwinCATClient.ConnectAsync`, producing an immediate
failure or undefined behaviour rather than a clear "bad argument" message. The same applies to
`subscribe --interval-ms` (negative -> `TimeSpan.FromMilliseconds(negative)` passed to
`SubscribeAsync`) and `--ams-port` (`AmsPort` accepts negative / out-of-range port numbers,
which only surface later as an opaque transport error). For a commissioning/diagnostic tool the
failure mode should be a readable up-front rejection.
**Recommendation:** Validate the numeric options at the top of each `ExecuteAsync` (or via a
shared helper on `TwinCATCommandBase`) and throw `CliFx.Exceptions.CommandException` with a
clear message when `TimeoutMs <= 0`, `IntervalMs <= 0`, or `AmsPort` falls outside `1..65535`.
**Resolution:** _(open)_
### Driver.TwinCAT.Cli-002
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `Commands/SubscribeCommand.cs:46-58` |
| Status | Open |
**Description:** The `OnDataChange` handler calls `console.Output.WriteLine(line)` synchronously.
In native ADS-notification mode the event is raised from the `Beckhoff.TwinCAT.Ads`
notification callback thread (see `TwinCATDriver.SubscribeAsync`, which invokes `OnDataChange`
from the ADS `AddNotificationAsync` callback). That write can interleave with the main thread's
`console.Output.WriteLineAsync(...)` "Subscribed to ..." banner and with subsequent change
events if the PLC pushes faster than a single write completes. A `TextWriter` is not guaranteed
thread-safe, so output lines can be garbled — undesirable for a tool whose stated purpose is
producing clean screen-recorded bug-report timelines. The same pattern exists in the other
driver CLIs (S7/Modbus), but those go through `PollGroupEngine`, whose change callbacks are
serialised on one poll loop; the TwinCAT native path has no such serialisation.
**Recommendation:** Serialise console writes from the change handler, e.g. wrap the
`WriteLine` body in a `lock` on a private object that the banner write also takes, or use
`TextWriter.Synchronized`. At minimum, gate it so the banner is written before the
subscription is registered (it already is) and lock the per-event writes against each other.
**Resolution:** _(open)_
### Driver.TwinCAT.Cli-003
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `Commands/SubscribeCommand.cs:56-58` |
| Status | Open |
**Description:** The subscribe banner reports the mechanism purely from the `--poll-only` flag
(`var mode = PollOnly ? "polling" : "ADS notification"`). The doc (`docs/Driver.TwinCAT.Cli.md`)
states the banner "announces which mechanism is in play". The CLI always declares exactly one
tag, so a registration that produces zero notification handles is unlikely, but `TwinCATDriver.
SubscribeAsync` silently `continue`s past any reference not found in `_tagsByName`/`_devices`
and a poll-mode fallback inside the driver is also possible in principle. The banner therefore
asserts a mechanism it has not actually confirmed. It is informational only, so the impact is
limited to a misleading diagnostic line.
**Recommendation:** Either derive the banner text from observable subscription state (e.g. the
returned `ISubscriptionHandle.DiagnosticId`, which is `twincat-native-sub-*` for the native
path vs the `PollGroupEngine` handle for poll mode) or soften the wording to "(requested:
ADS notification)" so it does not over-claim.
**Resolution:** _(open)_
### Driver.TwinCAT.Cli-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `TwinCATCommandBase.cs:26-29`, `Commands/BrowseCommand.cs` |
| Status | Open |
**Description:** `--poll-only` is declared on `TwinCATCommandBase`, so it is inherited by
`browse`. `BrowseCommand` only ever calls `DiscoverAsync` — it never subscribes — so
`UseNativeNotifications = !PollOnly` has no observable effect on a browse run. The flag still
appears in `otopcua-twincat-cli browse --help`, implying it changes browse behaviour when it
does not. `docs/Driver.TwinCAT.Cli.md` documents `--poll-only` only under `subscribe` and lists
no per-command flags for `browse` beyond `--prefix`/`--max`, so the help text and the docs
disagree.
**Recommendation:** Move `--poll-only` (and arguably the notification-only relevance of the
flag) onto an intermediate base shared by only `probe`/`read`/`subscribe`, or override/hide it
for `browse`. Alternatively document explicitly that the flag is a no-op for `browse`.
**Resolution:** _(open)_
### Driver.TwinCAT.Cli-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `Commands/ProbeCommand.cs:23`, `Commands/ReadCommand.cs:20`, `Commands/WriteCommand.cs:20`, `Commands/SubscribeCommand.cs:18` |
| Status | Open |
**Description:** The `--type` option is declared with the short alias `-t` on `read`, `write`,
and `subscribe`, but `ProbeCommand` declares `[CommandOption("type", ...)]` with no short
alias. An operator who has internalised `-t` from the other three verbs will get a CliFx
"unknown option" error on `probe -t Bool`. The inconsistency is gratuitous — all four commands
take the same `TwinCATDataType` option.
**Recommendation:** Add the `'t'` short alias to `ProbeCommand`'s `--type` option to match the
other three commands.
**Resolution:** _(open)_
### Driver.TwinCAT.Cli-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Cli.Tests/WriteCommandParseValueTests.cs` |
| Status | Open |
**Description:** The only test file covers `WriteCommand.ParseValue` and
`ReadCommand.SynthesiseTagName`. Other deterministic, router-independent logic is untested:
`TwinCATCommandBase.Gateway` (the `ads://{netId}:{port}` string the driver's
`TwinCATAmsAddress.TryParse` consumes — a regression here breaks every command), `BuildOptions`
(tag wiring, `UseNativeNotifications` toggle, `Probe.Enabled = false`), and `BrowseCommand`'s
`CollectingAddressSpaceBuilder` with its `--prefix`/`--max` filtering and the `RO`/`RW` access
derivation. These are pure and can be unit-tested without an AMS router. `InternalsVisibleTo`
is already wired for the test assembly. Note also the stale empty sibling test directory
`tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Cli.Tests` (no project, no files) — out of this
module's scope but worth flagging to whoever owns the test tree.
**Recommendation:** Add unit tests for `Gateway`/`DriverInstanceId` string composition, for
`BuildOptions` field wiring, and for the `CollectingAddressSpaceBuilder` prefix/max filtering
and access-classification logic.
**Resolution:** _(open)_
### Driver.TwinCAT.Cli-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `TwinCATCommandBase.cs:31-36` |
| Status | Open |
**Description:** The `Timeout` override has an empty `init` accessor with the comment
`/* driven by TimeoutMs */`. Because the base `DriverCommandBase.Timeout` is declared
`abstract { get; init; }`, the override must supply an `init`, but here it silently discards
any value. This is intentional, yet the empty body invites a future maintainer to "fix" it by
adding a backing field, which would then diverge from `TimeoutMs`. The XML `<inheritdoc/>`
gives no hint of the deliberate no-op. This is a maintainability/clarity nit, not a bug.
**Recommendation:** Replace `<inheritdoc/>` with a short summary stating that `Timeout` is a
computed projection of `--timeout-ms` and the `init` accessor is intentionally a no-op, so the
design intent survives refactoring.
**Resolution:** _(open)_
+426
View File
@@ -0,0 +1,426 @@
# Code Review — Driver.TwinCAT
| Field | Value |
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 5 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.TwinCAT-001, Driver.TwinCAT-002, Driver.TwinCAT-003, Driver.TwinCAT-004 |
| 2 | OtOpcUa conventions | Driver.TwinCAT-005, Driver.TwinCAT-006 |
| 3 | Concurrency & thread safety | Driver.TwinCAT-007, Driver.TwinCAT-008, Driver.TwinCAT-009 |
| 4 | Error handling & resilience | Driver.TwinCAT-010, Driver.TwinCAT-011 |
| 5 | Security | No issues found |
| 6 | Performance & resource management | Driver.TwinCAT-012 |
| 7 | Design-document adherence | Driver.TwinCAT-013, Driver.TwinCAT-014 |
| 8 | Code organization & conventions | Driver.TwinCAT-015 |
| 9 | Testing coverage | Driver.TwinCAT-016 |
| 10 | Documentation & comments | Driver.TwinCAT-004 (data-type comment), Driver.TwinCAT-014 |
## Findings
### Driver.TwinCAT-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `TwinCATDriver.cs:41-78` |
| Status | Resolved |
**Description:** `InitializeAsync` and `ReinitializeAsync` both ignore their `driverConfigJson`
parameter entirely. `InitializeAsync` builds device/tag state exclusively from `_options`,
captured once in the constructor. `ReinitializeAsync` calls `ShutdownAsync` then
`InitializeAsync(driverConfigJson, ...)` — but since `InitializeAsync` never reads
`driverConfigJson`, a `ReinitializeAsync` with a changed config silently re-applies the
original constructor-time options. Per `IDriver.ReinitializeAsync` docs and
`docs/v2/driver-stability.md` section "In-process only (Tier A/B)", `Reinitialize` is the only
Core-initiated path to apply a new config generation without a process restart. As written,
config changes (added/removed devices, tags, probe settings) to a TwinCAT driver instance
are never picked up at runtime.
**Recommendation:** Parse `driverConfigJson` in `InitializeAsync` (reuse
`TwinCATDriverFactoryExtensions` DTO + option-builder logic — extract it to a shared static
parser) and assign the resulting options to a mutable field, rather than relying on the
constructor-captured `_options`. Alternatively, document explicitly that the constructor is
the sole config source and have the Core recreate the driver instance on config change.
**Resolution:** Resolved 2026-05-22 — extracted the DTO→options parse into a shared TwinCATDriverFactoryExtensions.ParseOptions; InitializeAsync re-parses driverConfigJson into a now-mutable _options field, so ReinitializeAsync applies a changed config generation.
### Driver.TwinCAT-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `TwinCATDataType.cs:34-48`, `AdsTwinCATClient.cs:264-281` |
| Status | Resolved |
**Description:** `TwinCATDataTypeExtensions.ToDriverDataType` maps `LInt` and `ULInt` (signed/
unsigned 64-bit) to `DriverDataType.Int32` (comment: "matches Int64 gap"). The address-space
layer therefore creates a 32-bit OPC UA node for a 64-bit PLC value. Meanwhile
`AdsTwinCATClient.MapToClrType` reads `LInt`/`ULInt` as `long`/`ulong` (64-bit), so the read
path returns a boxed `long`/`ulong` into a node typed Int32. `DriverDataType` already has an
`Int64`/`UInt64` member (`DriverDataType.cs:16-19`), so the "gap" the comment refers to does
not exist. Values above `int.MaxValue` are silently truncated or produce a type mismatch at
the OPC UA encode layer; `UDInt` is also folded into `Int32`, so unsigned 32-bit values in
the range 0x80000000 to 0xFFFFFFFF surface as negative.
**Recommendation:** Map `LInt` to `Int64`, `ULInt` to `UInt64`, `UDInt` to `UInt32`, `UInt`
to `UInt16`, and `USInt`/`SInt` to their natural widths. Remove the stale "Int64 gap" comment.
**Resolution:** Resolved 2026-05-22 — ToDriverDataType now maps LInt→Int64, ULInt→UInt64, UDInt→UInt32, UInt/USInt→UInt16, Int/SInt→Int16, and the IEC time types→UInt32; removed the stale Int64-gap comment.
### Driver.TwinCAT-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `AdsTwinCATClient.cs:264-281`, `283-300` |
| Status | Resolved |
**Description:** `MapToClrType` has a `_ => typeof(int)` fallthrough and `ConvertForWrite` has
a `_ => throw NotSupportedException` fallthrough. `TwinCATDataType.Structure` is a declared
enum member, and a config-supplied tag can carry `DataType: "Structure"` because `ParseEnum`
in the factory accepts any enum name case-insensitively. A `Structure` tag therefore reads as
a 4-byte `int` against whatever the symbol actually is (a UDT blob) — a garbage/out-of-bounds
read rather than a clean rejection — while a write fails late with `NotSupportedException`.
Discovery `ToDriverDataType` maps `Structure` to `String`, compounding the inconsistency.
**Recommendation:** Reject `Structure`-typed pre-declared tags at `InitializeAsync` /
`TwinCATDriverFactoryExtensions.BuildTag` time with a clear error — the driver atomic surface
does not support UDT tags, and `BrowseSymbolsAsync` already correctly yields
`DataType = null` for them.
**Resolution:** Resolved 2026-05-22 — `BuildTag` now parses the `DataType` field first and rejects `TwinCATDataType.Structure` with an `InvalidOperationException` that names the tag and explains the limitation; configuration-time failure replaces the previous silent garbage read or late `NotSupportedException`.
### Driver.TwinCAT-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `TwinCATDataType.cs:24-27` |
| Status | Open |
**Description:** The inline comments for the IEC time types are inaccurate. TwinCAT `TIME` is
a duration (32-bit, milliseconds) — not "ms since epoch of day". `DATE` is stored as seconds
since 1970-01-01 (truncated to a day boundary), not "days since 1970-01-01". These types are
also all read/written as raw `uint` and mapped to `DriverDataType.Int32` — the operator sees
a raw counter, not a usable date/duration. Misleading comments will mislead the next
implementer who tries to add proper conversion.
**Recommendation:** Correct the comments to match the TwinCAT/IEC 61131-3 representation. If
date/time semantics are intended to be exposed properly, track a follow-up to decode them to
`DriverDataType.DateTime`; otherwise document that they surface as raw counters.
**Resolution:** _(open)_
### Driver.TwinCAT-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | OtOpcUa conventions |
| Location | `TwinCATDriver.cs` (whole file), `AdsTwinCATClient.cs` (whole file) |
| Status | Resolved |
**Description:** The driver performs no logging. `CLAUDE.md` Library Preferences mandate
Serilog with a rolling daily file sink. Connect failures, ADS error codes, symbol-browse
failures (`DiscoverAsync` swallows them in a bare `catch`), notification-registration
failures, and probe state transitions all vanish into status fields or are swallowed
silently. Operators get a `Degraded` health string with no correlatable log trail.
**Recommendation:** Inject an `ILogger`/Serilog logger and log at minimum: connect
success/failure per device, ADS errors with code, symbol-browse fallback (the `DiscoverAsync`
catch), native-notification registration failures, and host state transitions
(`TransitionDeviceState`).
**Resolution:** Resolved 2026-05-22 — added optional `ILogger<TwinCATDriver>` constructor parameter (defaults to `NullLogger`); logs connect success/failure in `EnsureConnectedAsync`, ADS read errors in `ReadAsync`, symbol-browse fallback in `DiscoverAsync`, notification-registration failures in `SubscribeAsync`, and host-state transitions in `TransitionDeviceState`.
### Driver.TwinCAT-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `TwinCATDriver.cs:406-411` |
| Status | Open |
**Description:** `ResolveHost` falls back to `DriverInstanceId` when there are no configured
devices and the reference is unknown. `DriverInstanceId` is a logical config-DB identifier,
not a host address; `IPerCallHostResolver` consumers expect a host key that correlates with
`GetHostStatuses()` entries (`HostConnectivityStatus.HostName` equals
`device.Options.HostAddress`). Returning the instance ID produces a host key that matches no
connectivity-status row.
**Recommendation:** Return a stable sentinel that will not be confused with a real host (an
empty string or a documented unresolved marker), or document why the instance ID is the chosen
fallback. Prefer the first device HostAddress only when one exists (already done).
**Resolution:** _(open)_
### Driver.TwinCAT-007
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `TwinCATDriver.cs:413-429` |
| Status | Resolved |
**Description:** `EnsureConnectedAsync` is not thread-safe. `ReadAsync`, `WriteAsync`,
`SubscribeAsync`, and the per-device `ProbeLoopAsync` background task can all call it
concurrently for the same `DeviceState`. The sequence `device.Client ??= _clientFactory.Create()`
followed by `await device.Client.ConnectAsync(...)` has no lock: two threads can both observe
`device.Client` null-or-disconnected, each create/connect a client, and one
`AdsTwinCATClient` is leaked (its `AdsClient` + `AdsNotificationEx` handler never disposed).
Worse, on the connect-failure path one thread does `device.Client.Dispose(); device.Client = null;`
while another thread is mid-`ConnectAsync` on that same client instance — a disposal race that
can throw `ObjectDisposedException` or corrupt the `AdsClient`. The probe loop runs
continuously, so this race is not hypothetical under any concurrent read/write load.
**Recommendation:** Guard `EnsureConnectedAsync` per-device with a `SemaphoreSlim` (one per
`DeviceState`), or use an async-lazy connect with proper invalidation. The S7/AB-CIP drivers
serialize device access with a `SemaphoreSlim` — follow that pattern. Note this also
serializes the wire, which `docs/v2/driver-specs.md` recommends for single-connection-per-PLC.
**Resolution:** Resolved 2026-05-22 — EnsureConnectedAsync is now serialized per device by a SemaphoreSlim (DeviceState.ConnectGate) with a double-checked connect, mirroring the S7 driver; no client is leaked and no disposal race remains.
### Driver.TwinCAT-008
| Field | Value |
|---|---|
| Severity | High |
| Category | Concurrency & thread safety |
| Location | `AdsTwinCATClient.cs:162-169`, `TwinCATDriver.cs:319-324` |
| Status | Resolved |
**Description:** Native ADS notification callbacks (`OnAdsNotificationEx`) run on the
`AdsClient` AMS router thread. `docs/v2/driver-specs.md` section 6 explicitly calls this out
as a code-review checklist item: "Callbacks must marshal to a managed work queue immediately
(no driver logic on the router thread) — blocking the router thread blocks every ADS
notification across the process." The current path invokes `reg.OnChange(...)` directly on the
router thread, and `OnChange` is the driver lambda that calls `OnDataChange?.Invoke(this, ...)`
— i.e. every downstream Core/OPC UA subscriber handler executes synchronously on the AMS
router thread. A single slow consumer stalls ADS notification delivery for every tag on every
device in the process.
**Recommendation:** Marshal notification values onto a bounded `Channel`/work queue drained by
a dedicated managed task before invoking `OnChange`/`OnDataChange`, exactly as the Galaxy
`EventPump` does. Keep the router-thread callback to a non-blocking enqueue only.
**Resolution:** Resolved 2026-05-22 — AdsTwinCATClient now enqueues native AdsNotificationEx callbacks onto a bounded Channel drained by a dedicated managed task; the AMS router thread only does a non-blocking TryWrite, so a slow consumer cannot stall ADS delivery.
### Driver.TwinCAT-009
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `TwinCATDriver.cs:80-99`, `41-72`, `366-388` |
| Status | Resolved |
**Description:** `ShutdownAsync` mutates `_devices`, `_tagsByName`, and `_nativeSubs` with no
synchronization while `ReadAsync`/`WriteAsync`/`SubscribeAsync` may be iterating or indexing
those same plain `Dictionary<>` instances on other threads (`_devices` and `_tagsByName` are
non-concurrent dictionaries). `ShutdownAsync` calls `_devices.Clear()`/`_tagsByName.Clear()`
concurrently with `_devices.TryGetValue` in a read — `Dictionary<>` is not safe for concurrent
read+write and can throw or corrupt internal state. `ReinitializeAsync` makes this worse: it
runs `ShutdownAsync` then `InitializeAsync` (which re-populates the same dictionaries) while
in-flight reads continue. The probe loop `EnsureConnectedAsync` also touches `DeviceState`
objects that `ShutdownAsync` is disposing — `ShutdownAsync` cancels `ProbeCts` but does not
await the probe task before calling `DisposeClient()`.
**Recommendation:** Either swap `_devices`/`_tagsByName` to `ConcurrentDictionary` and snapshot
them on rebuild, or introduce a lifecycle lock / `volatile` running guard so reads fail fast
with `BadServerHalted`/`BadNodeIdUnknown` once shutdown begins. Cancel and await the probe
tasks before disposing `DeviceState`s.
**Resolution:** Resolved 2026-05-22 — swapped `_devices` and `_tagsByName` to `ConcurrentDictionary` so concurrent `TryGetValue` / `Clear` calls are safe; added `DeviceState.ProbeTask` and updated `ShutdownAsync` to cancel then `await` each probe task before disposing the client and gate, eliminating the disposal race.
### Driver.TwinCAT-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `AdsTwinCATClient.cs:178-195` |
| Status | Resolved |
**Description:** `BrowseSymbolsAsync` checks `cancellationToken.IsCancellationRequested` and
does `yield break` (a clean completion) rather than throwing `OperationCanceledException`.
`DiscoverAsync` (`TwinCATDriver.cs:274`) explicitly has `catch (OperationCanceledException)
{ throw; }` to propagate cancellation distinctly from a genuine browse failure. Because the
client never throws on cancellation, a cancelled discovery silently completes as if the
symbol table were fully and successfully walked — the address space is built from a partial
symbol set with no indication it was truncated. The `SymbolLoaderFactory.Create` /
`loader.Symbols` enumeration itself is also not cancellable.
**Recommendation:** Call `cancellationToken.ThrowIfCancellationRequested()` instead of
`yield break` so a cancelled browse surfaces as cancellation, not as a successful but partial
discovery.
**Resolution:** Resolved 2026-05-22 — replaced `yield break` with `cancellationToken.ThrowIfCancellationRequested()` in the `foreach` loop so a cancelled browse propagates as `OperationCanceledException`, matching the `DiscoverAsync` expectation.
### Driver.TwinCAT-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `TwinCATStatusMapper.cs:29-42` |
| Status | Resolved |
**Description:** ADS error-code mapping has gaps and an inconsistency versus
`docs/v2/driver-specs.md` section 6. The spec documents symbol-not-found as 0x0701
(1793 decimal) and symbol-version-changed as 0x0702 (1794 decimal). `MapAdsError` maps
decimal 1798 to `BadNodeIdUnknown` (symbol not found) and 1793/1794 to `BadOutOfRange`
(invalid index group/offset). The decimal-vs-hex interpretation of the documented codes does
not line up, so the mapper appears to treat the symbol-version-changed code as a generic range
error. 0x0710 "Not ready / PLC in Config mode" has no entry and falls through to
`BadCommunicationError`; the driver-spec recommends distinguishing it. And 0x0702
symbol-version-changed is never routed to rediscovery (see Driver.TwinCAT-013).
**Recommendation:** Confirm the actual `AdsErrorCode` numeric values from
`Beckhoff.TwinCAT.Ads` (the SDK enum, not the doc hex shorthand) and align the mapper. Add an
explicit case for symbol-version-changed routed to rediscovery, and for PLC-in-Config mapped
to `BadOutOfService`/`BadInvalidState`.
**Resolution:** Resolved 2026-05-22 — confirmed all codes from `Beckhoff.TwinCAT.Ads` 7.0.172 `AdsErrorCode` enum. Rewrote `MapAdsError` with 20 explicit cases keyed to the correct decimal values. Fixed the critical bug: `AdsSymbolVersionChanged` was `0x0702u` (= `DeviceInvalidGroup`) but the actual `DeviceSymbolVersionInvalid` is 1809 (0x0711); corrected constant and updated all comments. Added `BadOutOfService` for `DeviceNotReady` (PLC not running) and `BadInvalidState` for `DeviceInvalidState` (PLC in Config mode, 0x0712) and `DeviceSymbolVersionInvalid` (0x0711). Added `BadOutOfService`/`BadInvalidState` OPC UA StatusCode constants to the mapper.
### Driver.TwinCAT-012
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `TwinCATDriver.cs:102`, `AdsTwinCATClient.cs:178-195` |
| Status | Resolved |
**Description:** `GetMemoryFootprint()` returns a hard-coded 0. `docs/v2/driver-stability.md`
section "In-process only (Tier A/B) — driver-instance allocation tracking" requires the
footprint to reflect "bytes attributable to their own caches (symbol cache, subscription
items, queued operations)", and section 6 of `driver-specs.md` explicitly identifies cached
symbol info as "the largest in-driver allocation" for TwinCAT and ties `FlushOptionalCachesAsync`
to flushing it. Reporting 0 means Core allocation-slope detection and cache-budget enforcement
are blind to this driver, and `FlushOptionalCachesAsync` is a no-op. (Note: the current
`BrowseSymbolsAsync` does not retain a symbol cache — it streams and discards — so
re-discovery re-downloads the whole symbol table each time, itself a performance concern for
`EnableControllerBrowse` deployments.)
**Recommendation:** Either implement an actual symbol cache + report its size via
`GetMemoryFootprint()` and flush it in `FlushOptionalCachesAsync`, or, if the
stream-and-discard design is intentional, report the real footprint of `_nativeSubs` /
`_tagsByName` and document that the driver holds no flushable cache.
**Resolution:** Resolved 2026-05-22 — `GetMemoryFootprint()` now returns `(_tagsByName.Count * 256L) + (_nativeSubs.Count * 512L)`; documented that the driver has no flushable symbol cache (stream-and-discard design) so `FlushOptionalCachesAsync` remains a documented no-op.
### Driver.TwinCAT-013
| Field | Value |
|---|---|
| Severity | High |
| Category | Design-document adherence |
| Location | `TwinCATDriver.cs:11-12` (capability list), whole file |
| Status | Resolved |
**Description:** `TwinCATDriver` does not implement `IRediscoverable`. Both
`docs/v2/driver-specs.md` section 6 and `docs/v2/driver-stability.md` section "TwinCAT — Deep
Dive" state this as the defining TwinCAT failure mode: "Symbol-version-changed (0x0702) is
the unique TwinCAT failure mode... The driver must catch 0x0702, mark its symbol cache
invalid, re-upload symbols, rebuild the address space subtree... Treat this as a
`IRediscoverable` invocation, not as a connection error." The `IRediscoverable` XML doc names
TwinCAT symbol-version-changed as a canonical example. The current driver maps the error to a
generic `BadOutOfRange`/`BadCommunicationError` quality code and never re-runs discovery, so
after a PLC program re-download every symbol handle and notification silently goes stale with
no address-space rebuild.
**Recommendation:** Implement `IRediscoverable`; detect the symbol-version-changed ADS error
on read/write/notification paths, raise `OnRediscoveryNeeded` with a scoped reason string, and
re-establish native notifications after the Core re-runs `DiscoverAsync`. This is explicitly
part of the documented driver contract, not optional.
**Resolution:** Resolved 2026-05-22 — TwinCATDriver implements IRediscoverable; AdsTwinCATClient detects ADS 0x0702 on read/write paths and raises OnSymbolVersionChanged, which the driver forwards as OnRediscoveryNeeded so Core rebuilds the address space.
### Driver.TwinCAT-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `TwinCATDriverOptions.cs:41-43`, `TwinCATDriverOptions.cs:57-62`, `AdsTwinCATClient.cs:145` |
| Status | Open |
**Description:** Several drifts between the implemented config surface and
`docs/v2/driver-specs.md` section 6. The spec connection-settings list has separate `Host`
(IP), `AmsNetId`, and `AmsPort` fields; the implementation collapses these into a single
`HostAddress` string parsed as ads://{netId}:{port}, so the target device IP has no home
field. `TwinCATProbeOptions.Timeout` (`TwinCATDriverOptions.cs:61`) is never read anywhere —
the probe path connects via `_options.Timeout` — a dead config field. The spec lists
`NotificationMaxDelayMs`; the code hard-codes max-delay 0 in `NotificationSettings`
(`AdsTwinCATClient.cs:145`) with no config knob.
**Recommendation:** Reconcile the driver-spec doc with the implemented `TwinCATDriverOptions`
shape (the doc is DRAFT, so updating it is acceptable). Remove or wire up
`TwinCATProbeOptions.Timeout`. Expose `NotificationMaxDelayMs` if batching control is wanted.
**Resolution:** _(open)_
### Driver.TwinCAT-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `TwinCATDriver.cs:431-432` |
| Status | Open |
**Description:** `Dispose()` runs `DisposeAsync().AsTask().GetAwaiter().GetResult()`
sync-over-async. `docs/v2/driver-stability.md` section Galaxy explicitly lists "sync-over-async
on the OPC UA stack thread" among the four 2026-04-13 stability findings that had to be
closed. `DisposeAsync` calls `ShutdownAsync`, which awaits `_poll.DisposeAsync()` and disposes
clients; if `Dispose()` is ever called on a thread with a single-threaded synchronization
context (the OPC UA stack), `GetResult()` can deadlock.
**Recommendation:** Make `Dispose()` perform a genuinely synchronous teardown. The operations
here — cancelling token sources, disposing clients, clearing dictionaries — are all
synchronous, and `PollGroupEngine.DisposeAsync` completes synchronously, so factor the
synchronous teardown out so `Dispose()` does not block on a `Task`.
**Resolution:** _(open)_
### Driver.TwinCAT-016
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/` |
| Status | Open |
**Description:** Unit coverage exists for AMS-address parsing, symbol-path parsing, read/write,
native notifications, symbol browse, and the capability surface. Gaps tied to the findings
above: no test exercises `ReinitializeAsync` with a changed config (Driver.TwinCAT-001 would
have been caught); no concurrency test drives `ReadAsync`/`WriteAsync`/probe against one
device simultaneously (Driver.TwinCAT-007/009); no test covers the symbol-version-changed to
rediscovery path (Driver.TwinCAT-013, currently unimplemented); no test covers a `Structure`-
typed pre-declared tag (Driver.TwinCAT-003); no test asserts 64-bit `LInt`/`ULInt` round-trip
without truncation (Driver.TwinCAT-002).
**Recommendation:** Add unit tests for the above paths once the corresponding findings are
addressed, especially a concurrency stress test for `EnsureConnectedAsync` and a
`ReinitializeAsync`-applies-new-config test.
**Resolution:** _(open)_
+392
View File
@@ -0,0 +1,392 @@
# Code Reviews
<!-- GENERATED FILE - do not edit by hand. Regenerate with: python code-reviews/regen-readme.py -->
Cross-module code review index for the OtOpcUa server codebase (`lmxopcua`). The review process is defined in [../REVIEW-PROCESS.md](../REVIEW-PROCESS.md).
Each module's `findings.md` is the source of truth; this file is generated from them by `regen-readme.py` and must not be edited by hand.
## Module status
| Module | Reviewer | Date | Commit | Status | Open | Total |
|---|---|---|---|---|---|---|
| [Admin](Admin/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 3 | 13 |
| [Analyzers](Analyzers/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 7 |
| [Client.CLI](Client.CLI/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 8 | 10 |
| [Client.Shared](Client.Shared/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 11 |
| [Client.UI](Client.UI/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 6 | 11 |
| [Configuration](Configuration/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 11 |
| [Core](Core/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 6 | 12 |
| [Core.Abstractions](Core.Abstractions/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 8 |
| [Core.AlarmHistorian](Core.AlarmHistorian/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 2 | 11 |
| [Core.ScriptedAlarms](Core.ScriptedAlarms/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 6 | 12 |
| [Core.Scripting](Core.Scripting/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 11 |
| [Core.VirtualTags](Core.VirtualTags/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 7 | 13 |
| [Driver.AbCip](Driver.AbCip/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 15 |
| [Driver.AbCip.Cli](Driver.AbCip.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 6 | 8 |
| [Driver.AbLegacy](Driver.AbLegacy/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 3 | 13 |
| [Driver.AbLegacy.Cli](Driver.AbLegacy.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 6 | 7 |
| [Driver.Cli.Common](Driver.Cli.Common/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 2 | 6 |
| [Driver.FOCAS](Driver.FOCAS/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 12 |
| [Driver.FOCAS.Cli](Driver.FOCAS.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 5 |
| [Driver.Galaxy](Driver.Galaxy/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 4 | 14 |
| [Driver.Historian.Wonderware](Driver.Historian.Wonderware/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 7 | 12 |
| [Driver.Historian.Wonderware.Client](Driver.Historian.Wonderware.Client/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 10 |
| [Driver.Modbus](Driver.Modbus/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 7 | 12 |
| [Driver.Modbus.Addressing](Driver.Modbus.Addressing/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 3 | 9 |
| [Driver.Modbus.Cli](Driver.Modbus.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 6 | 8 |
| [Driver.OpcUaClient](Driver.OpcUaClient/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 2 | 15 |
| [Driver.S7](Driver.S7/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 14 |
| [Driver.S7.Cli](Driver.S7.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 4 | 7 |
| [Driver.TwinCAT](Driver.TwinCAT/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 5 | 16 |
| [Driver.TwinCAT.Cli](Driver.TwinCAT.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 7 | 7 |
| [Server](Server/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 6 | 15 |
## Pending findings
Findings with status `Open` or `In Progress`, ordered by severity.
| ID | Severity | Category | Location | Description |
|---|---|---|---|---|
| Admin-010 | Low | OtOpcUa conventions | `Components/App.razor:9,16` | `App.razor` loads Bootstrap CSS and JS from the `cdn.jsdelivr.net` CDN. `admin-ui.md` section "Tech Stack" specifies "Bootstrap 5 vendored under `wwwroot/lib/bootstrap/`" precisely so the Admin app has no third-party runtime dependency. A… |
| Admin-011 | Low | Concurrency & thread safety | `Hubs/FleetStatusPoller.cs:24-26,98-103` | `FleetStatusPoller` keeps three plain `Dictionary<>` fields (`_last`, `_lastRole`, `_lastResilience`) mutated from `PollOnceAsync`. The poller `ExecuteAsync` loop is single-threaded so the steady-state poll path is safe, but `ResetCache()`… |
| Admin-012 | Low | Design-document adherence | `Services/EquipmentCsvImporter.cs:18-19,33-37,229,232` | `EquipmentCsvImporter` declares `EquipmentId` as a required CSV column and parses it into a `required` field. `admin-ui.md` section "Equipment CSV import" (revised after adversarial review finding #4) is explicit: "No `EquipmentId` column… |
| Analyzers-002 | Low | Correctness & logic bugs | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:46-50,130` | `AlarmSurfaceInvoker` is listed in `WrapperTypes`, but `AlarmSurfaceInvoker`'s public methods (`SubscribeAsync`, `UnsubscribeAsync`, `AcknowledgeAsync`) take no lambda arguments at all — callers pass `IReadOnlyList<...>` / `IAlarmSubscript… |
| Analyzers-003 | Low | Error handling & resilience | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:80,114-116` | `IsInsideWrapperLambda` is passed `context.Operation.SemanticModel` and returns `false` when that model is `null`. A `false` return means "not wrapped", so a null semantic model produces a false-positive diagnostic rather than silently ski… |
| Analyzers-004 | Low | Performance & resource management | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:95-112` | `ImplementsGuardedInterface` runs on every invocation operation in the compilation (every keystroke in the IDE). For each candidate it allocates via `AllInterfaces.Concat(new[] { method.ContainingType })`, builds a fully-qualified display… |
| Analyzers-005 | Low | Design-document adherence | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:33-43` | `CapabilityInvoker`'s XML doc (`src/Core/.../Resilience/CapabilityInvoker.cs:15-17`) enumerates the routed capability surface as `IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, and all… |
| Analyzers-007 | Low | Documentation & comments | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:21-26` | The `<remarks>` block states the analyzer "matches by receiver-interface identity using Roslyn's semantic model, not by method name". This is accurate for the guarded-call detection (`ImplementsGuardedInterface` uses symbols), but the wrap… |
| Client.CLI-002 | Low | Correctness & logic bugs | `Commands/SubscribeCommand.cs:129-137` | The summary computes `neverWentBad` as every target whose node-id key is absent from the `everBad` dictionary. A node that received no update at all is also absent from `everBad`, so it is counted in `neverWentBad` and printed under the he… |
| Client.CLI-003 | Low | Correctness & logic bugs | `Commands/BrowseCommand.cs:29-30`, `Commands/SubscribeCommand.cs:20-27`, `Commands/AlarmsCommand.cs:28-29`, `Commands/HistoryReadCommand.cs:42-43` | Numeric command options accept any value with no range validation. `--depth`, `--interval`, `--max-depth`, `--max`, and the history `--interval` can all be supplied as `0` or a negative number. A negative `--depth`/`--max-depth` silently d… |
| Client.CLI-004 | Low | OtOpcUa conventions | `Commands/SubscribeCommand.cs:13-37` | `SubscribeCommand` is the only command in the module whose constructor and all `[CommandOption]` properties have no XML doc comments. Every other command (`ConnectCommand`, `ReadCommand`, `WriteCommand`, `BrowseCommand`, `AlarmsCommand`, `… |
| Client.CLI-006 | Low | Error handling & resilience | `Commands/HistoryReadCommand.cs:73`, `Commands/HistoryReadCommand.cs:76`, `Helpers/NodeIdParser.cs:39` | Operator input-format errors surface as raw .NET exceptions rather than clean CLI errors. An unparseable start/end value throws `FormatException` straight out of `DateTime.Parse`; an invalid node id throws `FormatException`/`ArgumentExcept… |
| Client.CLI-007 | Low | Performance & resource management | `CommandBase.cs:112-123` | `ConfigureLogging` builds a new Serilog `LoggerConfiguration`, creates a logger, and assigns it to the static `Log.Logger` without disposing the previously assigned logger. For a single CLI invocation this leaks at most one logger and the… |
| Client.CLI-008 | Low | Documentation & comments | `docs/Client.CLI.md:158-217` | `docs/Client.CLI.md` is stale relative to the code at this commit. (1) The `subscribe` command section documents only `-n` and `-i`, but the code (`SubscribeCommand`) also exposes `-r/--recursive`, `--max-depth`, `-q/--quiet`, `--duration`… |
| Client.CLI-009 | Low | Code organization & conventions | `Commands/SubscribeCommand.cs:66-165`, `Commands/AlarmsCommand.cs:52-91` | Both long-running commands attach an event handler (`service.DataChanged += ...`, `service.AlarmEvent += ...`) with a lambda and never detach it. Because the handler closes over `console`, the captured console and the closure remain refere… |
| Client.CLI-010 | Low | Testing coverage | `tests/Client/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests/SubscribeCommandTests.cs` | The new `SubscribeCommand` capabilities are largely untested. The four `SubscribeCommandTests` cover only single-node subscribe, unsubscribe-on-cancel, disconnect-in-finally, and the subscription message. There is no test for the `--recurs… |
| Client.Shared-003 | Low | Correctness & logic bugs | `Adapters/DefaultSessionAdapter.cs:76`, `Adapters/DefaultSessionAdapter.cs:273` | `WriteValueAsync` returns `response.Results[0]` and `CallMethodAsync` reads `result.Results[0]` without first checking the `Results` collection is non-empty. A malformed or service-level-faulted response (empty `Results` alongside a servic… |
| Client.Shared-004 | Low | OtOpcUa conventions | `Adapters/DefaultSessionAdapter.cs:228`, `Adapters/DefaultSessionAdapter.cs:121`, `Adapters/DefaultSessionAdapter.cs:172` | `CloseAsync`, `HistoryReadRawAsync`, and `HistoryReadAggregateAsync` are declared `async Task` but call the synchronous `Session.Close()` / `Session.HistoryRead(...)` APIs and contain no `await`. The history methods run a blocking synchron… |
| Client.Shared-009 | Low | Error handling & resilience / Documentation & comments | `OpcUaClientService.cs:302-322` | `AcknowledgeAlarmAsync` is typed `Task<StatusCode>` and its XML doc implies the returned code reports the ack outcome, but the method unconditionally `return StatusCodes.Good`. The actual failure path is `DefaultSessionAdapter.CallMethodAs… |
| Client.Shared-010 | Low | Performance & resource management | `Models/ConnectionSettings.cs:48`, `OpcUaClientService.cs:408-417` | `ConnectionSettings.CertificateStorePath` is initialized to `ClientStoragePaths.GetPkiPath()` as a property initializer, so every `ConnectionSettings` instantiation runs `Environment.GetFolderPath` + `Path.Combine` and, on the first call p… |
| Client.Shared-011 | Low | Testing coverage | `tests/Client/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests/OpcUaClientServiceTests.cs` | The test suite is solid for the happy paths, connection lifecycle, and single-failover behavior. Gaps relative to the findings above: (a) no test exercises concurrent `SubscribeAsync`/failover to expose the `_activeDataSubscriptions` race… |
| Client.UI-003 | Low | OtOpcUa conventions | `ZB.MOM.WW.OtOpcUa.Client.UI.csproj:20-21`, `Program.cs:14-20` | The csproj references `Serilog` and `Serilog.Sinks.Console`, and `docs/Client.UI.md` lists Serilog as the logging technology, but no source file in the module uses Serilog. `Program.BuildAvaloniaApp()` uses Avalonia's `LogToTrace()` and th… |
| Client.UI-004 | Low | OtOpcUa conventions | `Views/MainWindow.axaml.cs:125-138` | `OnBrowseCertPathClicked` uses `OpenFolderDialog`, which is obsolete in Avalonia 11.x (the version pinned in the csproj). The supported replacement is the `StorageProvider` API (`StorageProvider.OpenFolderPickerAsync`). Using the obsolete… |
| Client.UI-006 | Low | Error handling & resilience | `ViewModels/MainWindowViewModel.cs:244-252`, `ViewModels/AlarmsViewModel.cs:88-112`, `ViewModels/SubscriptionsViewModel.cs:79-94` | Many catch blocks swallow exceptions silently with an empty body and only a comment (`// Redundancy info not available`, `// Subscribe failed`, `// Subscription failed; no item added`, and others). When a subscribe, alarm-subscribe, or red… |
| Client.UI-009 | Low | Design-document adherence | `ViewModels/HistoryViewModel.cs:44-54` | `HistoryViewModel.AggregateTypes` exposes eight entries: `null` (Raw) plus Average, Minimum, Maximum, Count, Start, End, and `StandardDeviation`. `docs/Client.UI.md` ("Query Options" table) lists only "Raw (default), Average, Minimum, Maxi… |
| Client.UI-010 | Low | Code organization & conventions | `Controls/DateTimeRangePicker.axaml.cs:33-37`, `Controls/DateTimeRangePicker.axaml.cs:70-80` | `DateTimeRangePicker` declares `MinDateTimeProperty` / `MaxDateTimeProperty` styled properties with public CLR accessors, but neither is read anywhere in the control. `TryParseDateTime`, `OnStartLostFocus`, and `OnEndLostFocus` never clamp… |
| Client.UI-011 | Low | Documentation & comments | `Views/MainWindow.axaml:81`, `Services/JsonSettingsService.cs:11-15` | The certificate-store-path `TextBox` watermark reads `(default: AppData/LmxOpcUaClient/pki)`, referencing the legacy pre-task-#208 folder name. Per `CLAUDE.md` / `docs/Client.UI.md` the canonical path is now `{LocalAppData}/OtOpcUaClient/`… |
| Configuration-004 | Low | OtOpcUa conventions | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Enums/NodePermissions.cs:8`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/OtOpcUaConfigDbContext.cs:417` | `NodePermissions` is declared `[Flags] enum ... : uint`, while its XML doc and `NodeAcl.PermissionFlags`' doc both say "stored as int", and `ConfigureNodeAcl` uses `HasConversion<int>()` — a `uint``int` conversion. Only bits 011 are used… |
| Configuration-005 | Low | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/LiteDbConfigCache.cs:50` | `PutAsync` performs a non-atomic find-then-insert/update. Two concurrent `PutAsync` calls for the same `(ClusterId, GenerationId)` can both observe `existing is null` and both `Insert`, producing two rows for one generation. The constructo… |
| Configuration-007 | Low | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/GenerationApplier.cs:44` | `ApplyPass` wraps each callback in `catch (Exception ex)`. This swallows `OperationCanceledException` — a cancellation during a callback is recorded as just another entity error string and the applier keeps walking the remaining passes ins… |
| Configuration-010 | Low | Security | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs:81` | On central-DB read failure the warning log records the full exception object. Callers pass arbitrary `centralFetch` delegates; if any delegate closes over a connection string, an exception thrown from it (or a `SqlException` carrying serve… |
| Configuration-011 | Low | Testing coverage | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/GenerationApplier.cs:7`, `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:60` | The companion test project covers the cache, schema compliance, stored procedures, and `DraftValidator` well, but two flagged behaviours are not pinned: (a) `GenerationApplier` ordering/cancellation when a Removed callback fails — no test… |
| Core-004 | Low | OtOpcUa conventions | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs:55,72,87` | `DriverHost` is a library type whose async calls (`driver.InitializeAsync`, `driver.ShutdownAsync`) do not use `ConfigureAwait(false)`, whereas the sibling `CapabilityInvoker` and `AlarmSurfaceInvoker` in the same module consistently do. T… |
| Core-008 | Low | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64` | The XML summary of `BuildAddressSpaceAsync` states "Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted, but other drivers remain available." The method body contains no such isolation: an exception fro… |
| Core-009 | Low | Performance & resource management | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs:121-128` | `ExecuteWriteAsync` calls `_optionsAccessor()` three times for a single non-idempotent write (once for the `with` expression, once inside the dictionary initializer for `.Resolve(...)`, plus the discarded base). On the per-write hot path i… |
| Core-010 | Low | Code organization & conventions | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceOptions.cs:45-52` | `DriverResilienceOptions.Resolve` indexes the tier-default dictionary directly (`defaults[capability]`) with no fallback. Any future addition to `DriverCapability` that is not also added to all three tier tables in `GetTierDefaults` will m… |
| Core-011 | Low | Testing coverage | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs:58-75` | `PermissionTrieBuilder.Descend` has a two-branch behaviour: with a `scopePaths` lookup it descends the real hierarchy; without one it falls back to placing every non-cluster row directly under the root keyed by `ScopeId` ("works for determ… |
| Core-012 | Low | Documentation & comments | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs:26`, `src/Core/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs:11-22` | Two stale doc comments. (1) `WedgeDetector` — the `<summary>` above the constructor reads "Whether the driver reported itself `DriverState.Healthy` at construction." The constructor takes only a `TimeSpan threshold` and the detector is doc… |
| Core.Abstractions-004 | Low | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs:23-40` | `Register` performs a check-then-act sequence (`snapshot.ContainsKey` then build `next` then `Interlocked.Exchange`) that is not atomic. Two threads registering concurrently can both pass the duplicate check and both build a `next` diction… |
| Core.Abstractions-005 | Low | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:90,99` | Both the initial-poll and steady-state catch blocks use a bare `catch { }` that swallows every exception type, including non-transient programmer errors such as `NullReferenceException` and `ArgumentOutOfRangeException` (see Core.Abstracti… |
| Core.Abstractions-006 | Low | Code organization & conventions | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:63,84-86`, `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/Historian/IHistorianDataSource.cs:30,63` | The two history-read surfaces use inconsistent integer types for the same "maximum rows" concept. `IHistoryProvider.ReadRawAsync` and `IHistorianDataSource.ReadRawAsync` take `uint maxValuesPerNode`, but `ReadEventsAsync` (on both interfac… |
| Core.Abstractions-007 | Low | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests/PollGroupEngineTests.cs` | `PollGroupEngine` is the only behavioural (non-DTO) type in the module and its tests, while solid for the happy paths, miss two paths that this review identifies as defect-prone: (a) no test exercises an array-valued tag whose contents are… |
| Core.Abstractions-008 | Low | Documentation & comments | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverHealth.cs:9`, `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs:39-43,65-69` | Two XML-doc inaccuracies: 1. `DriverHealth.LastError` is documented as "Most recent error message; null when state is Healthy." The `DriverState` enum also defines `Degraded`, `Reconnecting`, and `Faulted` states, all of which carry an err… |
| Core.AlarmHistorian-008 | Low | Performance & resource management | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,255-278` | Each `EnqueueAsync` (one per alarm transition — a hot path on a busy plant) opens a connection, runs `EnforceCapacity` (a `COUNT(*)` over the queue table on every single enqueue), serializes JSON, inserts, and closes the connection. The un… |
| Core.AlarmHistorian-011 | Low | Documentation & comments | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs:5-9,76`, `AlarmHistorianEvent.cs:20` | Several doc-comments reference the retired v1 architecture. The `IAlarmHistorianSink` summary says ingestion "routes through Galaxy.Host's pipe" and `IAlarmHistorianWriter` says "Stream G wires this to the Galaxy.Host IPC client", but `doc… |
| Core.ScriptedAlarms-003 | Low | Documentation & comments | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` | `docs/ScriptedAlarms.md` (Composition step 3) and the `OnUpstreamChange` comment ("Fire-and-forget so driver-side dispatch isn't blocked", line 225-226) describe the `OnEvent` emission path as non-blocking / fire-and-forget. In the code, `… |
| Core.ScriptedAlarms-006 | Low | Concurrency & thread safety | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` | `OnUpstreamChange` and `RunShelvingCheck` both launch fire-and-forget tasks (`_ = ReevaluateAsync(...)`, `_ = ShelvingCheckAsync(...)`) with `CancellationToken.None`. There is no tracking of these in-flight tasks, so `Dispose` cannot await… |
| Core.ScriptedAlarms-008 | Low | Performance & resource management | `Part9StateMachine.cs:261-268` | `AppendComment` copies the entire existing comment list into a new `List` on every audit-producing transition (ack, confirm, shelve, unshelve, enable, disable, add-comment, auto-unshelve). The `Comments` list is append-only and unbounded —… |
| Core.ScriptedAlarms-009 | Low | Performance & resource management | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` | `BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently,… |
| Core.ScriptedAlarms-010 | Low | Design-document adherence | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` | Quality handling is inconsistent across the three places that inspect a `DataValueSnapshot.StatusCode`. `AreInputsReady` (engine, line 333) treats only outright Bad (bit 31) as not-ready, so an Uncertain-quality input is fed to the predica… |
| Core.ScriptedAlarms-011 | Low | Code organization & conventions | `Part9StateMachine.cs:275` | `TransitionResult.NoOp(state, reason)` takes a `reason` string parameter that is documented in the calling code as a diagnostic ("disabled — predicate result ignored", "already acknowledged", etc.) but the factory method silently discards… |
| Core.Scripting-005 | Low | Correctness & logic bugs | `DependencyExtractor.cs:97` | A raw string literal token passed as the tag path (a raw triple-quote literal) tokenizes as `SingleLineRawStringLiteralToken` / `MultiLineRawStringLiteralToken`, not `StringLiteralToken`. The check `literal.Token.IsKind(SyntaxKind.StringLi… |
| Core.Scripting-006 | Low | Concurrency & thread safety | `CompiledScriptCache.cs:55` | On a failed compile the `catch` block calls `_cache.TryRemove(key, out _)` without a value comparison. If two threads race a miss for the same bad source, both observe the same faulted `Lazy` and throw, and both call `TryRemove(key)`. If a… |
| Core.Scripting-008 | Low | Performance & resource management | `CompiledScriptCache.cs:34`, `ScriptEvaluator.cs:34` | `CompiledScriptCache` has no capacity bound (acknowledged in the class remarks) and no eviction. Each cached `ScriptEvaluator` holds a Roslyn `ScriptRunner<T>` delegate, which keeps the dynamically emitted script assembly loaded for the pr… |
| Core.Scripting-009 | Low | Design-document adherence | `ForbiddenTypeAnalyzer.cs:45` | The Phase 7 plan decision #6 (`docs/v2/implementation/phase-7-scripting-and-alarming.md`) enumerates the forbidden surface as "No HttpClient / File / Process / reflection". `ForbiddenTypeAnalyzer` actually denies a broader set — `System.Th… |
| Core.Scripting-011 | Low | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/` | Two source files have no direct test coverage: `ScriptContext` (`Deadband` static helper is exercised only indirectly through `ScriptSandboxTests`, and not for its boundary `tolerance` behaviour) and `ScriptSandbox.Build` itself (the `Argu… |
| Core.VirtualTags-004 | Low | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:349` | `CoerceResult`'s switch has a default arm (`_ => raw`) that returns the script's raw return value uncoerced for any `DriverDataType` not in the explicit list (e.g. an array type, Byte, or a future enum member). The resulting `DataValueSnap… |
| Core.VirtualTags-006 | Low | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:177-182`, `:395-401` | `Subscribe` does `_observers.GetOrAdd(path, _ => [])` then `lock (list) { list.Add(observer); }`. When `Unsub.Dispose` removes the last observer, the now-empty List is left in `_observers` and the dictionary entry is never removed. For a l… |
| Core.VirtualTags-007 | Low | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/TimerTriggerScheduler.cs:58` | `Tick` calls `_engine.EvaluateOneAsync(p, _cts.Token).GetAwaiter().GetResult()`, blocking the `System.Threading.Timer` callback thread (a thread-pool thread) for the full duration of the evaluation. Because `EvaluateInternalAsync` serialis… |
| Core.VirtualTags-009 | Low | Performance & resource management | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/DependencyGraph.cs:64-65`, `:72-73` | `DirectDependencies` and `DirectDependents` allocate a fresh empty `HashSet<string>` on every call for an unregistered node. `DirectDependents` is called inside the `TopologicalSort` Kahn loop and the `CascadeAsync` DFS, so for a graph wit… |
| Core.VirtualTags-010 | Low | Documentation & comments | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/ITagUpstreamSource.cs:18`, `VirtualTagContext.cs:30`, `VirtualTagDefinition.cs:28` | Several XML docs reference component names that do not exist in the codebase. `ITagUpstreamSource` XML doc says the subscription path "feeds the engine's ChangeTriggerDispatcher" -- there is no ChangeTriggerDispatcher; the actual path is `… |
| Core.VirtualTags-011 | Low | Code organization & conventions | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:404-409` | `VirtualTagState` records a Writes set (the `ctx.SetVirtualTag` targets extracted by `DependencyExtractor`), but nothing in the engine reads it -- it is captured at `Load` and never used. Declared write targets are not validated against th… |
| Core.VirtualTags-013 | Low | Documentation & comments | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/DependencyGraph.cs:266-270` | `DependencyCycleException.BuildMessage` renders each cycle as `string.Join(" -> ", c) + " -> " + c[0]`, presenting the SCC member list as a traversable edge path that loops back to its first element. Tarjan's algorithm returns the members… |
| Driver.AbCip-007 | Low | OtOpcUa conventions | `AbCipDriver.cs` (whole file), `AbCipAlarmProjection.cs`, `LibplctagTagRuntime.cs` | `CLAUDE.md` Library Preferences mandate Serilog with a rolling daily file sink. The driver has no logging at all: no `ILogger`/Serilog dependency is injected or used. Failure paths instead swallow exceptions into the `_health` string (`Rea… |
| Driver.AbCip-011 | Low | Error handling & resilience | `AbCipDriver.cs:144-152`, `AbCipDriverOptions.cs:131-143` | `InitializeAsync` only starts probe loops when `_options.Probe.Enabled` is true AND `Probe.ProbeTagPath` is non-blank. When `Probe.Enabled` is true (the default) but `ProbeTagPath` is null (also the default; the doc comment says "PR 8 wire… |
| Driver.AbCip-012 | Low | Performance & resource management | `LibplctagTemplateReader.cs:15-35`, `AbCipDriver.cs:88-92` | `LibplctagTemplateReader` is created per `FetchUdtShapeAsync` call, and each call constructs a fresh libplctag `Tag` for the @udt pseudo-tag, initializes it (a CIP connection handshake), reads, and disposes it. There is no reuse of the `Ta… |
| Driver.AbCip-013 | Low | Design-document adherence | `AbCipDriverOptions.cs:70-73`, `PlcFamilies/AbCipPlcFamilyProfile.cs:13-19`, `LibplctagTagRuntime.cs:16-27` | `driver-specs.md` specifies the AB CIP per-device connection settings as discrete fields: Host, Path, PlcType, TimeoutMs, AllowPacking, ConnectionSize. The implementation instead collapses host + path into a single opaque ab:// URL string… |
| Driver.AbCip-015 | Low | Documentation & comments | `AbCipDriver.cs:9-11`, `PlcTagHandle.cs:23-27,53-58`, `AbCipTemplateCache.cs:12-15`, `IAbCipTagEnumerator.cs:6-11`, `AbCipDriverOptions.cs:21` | Numerous comments are stale relative to the commit under review. `AbCipDriver.cs:9-11` says the driver "Implements IDriver only for now" with capabilities shipping "in subsequent PRs (3-8)" while the class already implements all of them. `… |
| Driver.AbCip.Cli-003 | Low | Concurrency & thread safety | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/SubscribeCommand.cs:50-56,60-61` | The `OnDataChange` handler writes change lines to `console.Output` (a `TextWriter`) from the driver's poll-engine callback thread, while the command's main flow concurrently writes the "Subscribed to ... Ctrl+C to stop." line on the CLI th… |
| Driver.AbCip.Cli-004 | Low | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/SubscribeCommand.cs:28,58`; `AbCipCommandBase.cs:26-34` | `--interval-ms` (`IntervalMs`) is taken verbatim and passed as `TimeSpan.FromMilliseconds(IntervalMs)` to `SubscribeAsync` with no validation. A zero or negative value produces a non-positive `TimeSpan`; the option description claims "Poll… |
| Driver.AbCip.Cli-005 | Low | Performance & resource management | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/DriverCommandBase.cs:51-59` | `ConfigureLogging` assigns a freshly created Serilog logger to the process-global `Log.Logger` but never calls `Log.CloseAndFlush()`. For a short-lived one-shot command (`probe`, `read`, `write`) the process exit flushes the console sink,… |
| Driver.AbCip.Cli-006 | Low | Design-document adherence | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/AbCipCommandBase.cs:29-34` | `AbCipCommandBase` overrides the abstract `DriverCommandBase.Timeout` property with a getter derived from `TimeoutMs` and an empty `init` body (`init { /* driven by TimeoutMs */ }`). Because the override has no `[CommandOption]` attribute,… |
| Driver.AbCip.Cli-007 | Low | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli.Tests/WriteCommandParseValueTests.cs` | The only test file covers `WriteCommand.ParseValue` and `ReadCommand.SynthesiseTagName` — both pure static helpers. There is no coverage for `AbCipCommandBase.BuildOptions` (the flag-to-`AbCipDriverOptions` mapping that all four commands d… |
| Driver.AbCip.Cli-008 | Low | Documentation & comments | `docs/Driver.AbCip.Cli.md:8-9` | `docs/Driver.AbCip.Cli.md` opens with "Second of four driver test-client CLIs (Modbus -> AB CIP -> AB Legacy -> S7 -> TwinCAT)." The count "four" contradicts the chain that follows it (five names) and contradicts `docs/DriverClis.md`, whic… |
| Driver.AbLegacy-005 | Low | OtOpcUa conventions | `AbLegacyDriver.cs` (whole file) | The driver uses no `ILogger`/Serilog at all. Probe-loop failures, runtime initialisation failures, libplctag non-zero statuses, and read/write exceptions are folded into `DriverHealth.Detail` strings but never logged. CLAUDE.md names Seril… |
| Driver.AbLegacy-011 | Low | Performance & resource management | `AbLegacyDriver.cs:440` | `Dispose()` is implemented as `DisposeAsync().AsTask().GetAwaiter().GetResult()` - sync-over-async. `ShutdownAsync` awaits `_poll.DisposeAsync()` (which completes synchronously) and does no other real async work, so a deadlock is unlikely… |
| Driver.AbLegacy-013 | Low | Code organization & conventions | `AbLegacyDriver.cs:340-345`, `AbLegacyDriver.cs:238-264` | Two minor organisational issues: 1. `ResolveHost` returns `_options.Devices.FirstOrDefault()?.HostAddress ?? DriverInstanceId` when the reference is unknown and no devices are configured. `DriverInstanceId` is not a host address (ab://...)… |
| Driver.AbLegacy.Cli-002 | Low | Correctness & logic bugs | `Commands/WriteCommand.cs:27-29`, `Program.cs:6-9` | The `--value` option help text states "booleans accept true/false/1/0", but `ParseBool` (`WriteCommand.cs:74-80`) and the error message also accept `on/off` and `yes/no`, and `DriverClis.md` documents the full `true/false/1/0/yes/no/on/off… |
| Driver.AbLegacy.Cli-003 | Low | Concurrency & thread safety | `Commands/SubscribeCommand.cs:47-53` | The `OnDataChange` handler calls `console.Output.WriteLine(line)` (the synchronous overload) directly from the `PollGroupEngine` poll thread. The poll engine raises change events from a background timer/loop thread, so two ticks that fire… |
| Driver.AbLegacy.Cli-004 | Low | Error handling & resilience | `Commands/ProbeCommand.cs:37-56`, `Commands/ReadCommand.cs:39-50`, `Commands/WriteCommand.cs:48-59`, `Commands/SubscribeCommand.cs:41-76` | Every command does `await using var driver = new AbLegacyDriver(...)` *and* an explicit `await driver.ShutdownAsync(...)` in the `finally`. `AbLegacyDriver` `DisposeAsync` itself calls `ShutdownAsync`, so the driver is shut down twice on t… |
| Driver.AbLegacy.Cli-005 | Low | Design-document adherence | `Commands/SubscribeCommand.cs:23-25`, `docs/Driver.AbLegacy.Cli.md:94-96` | The subscribe command interval option is `--interval-ms` (default 1000). `docs/Driver.AbLegacy.Cli.md` shows the subscribe example as `otopcua-ablegacy-cli subscribe ... -i 500`, which works because of the short alias `'i'`, but the doc ne… |
| Driver.AbLegacy.Cli-006 | Low | Code organization & conventions | `Commands/ProbeCommand.cs:20-22` | `ProbeCommand` declares its `--type` option with no short alias, while `ReadCommand`, `WriteCommand`, and `SubscribeCommand` all declare `--type` with the short alias `'t'`. `ProbeCommand` also gives `--address` the alias `'a'`, matching t… |
| Driver.AbLegacy.Cli-007 | Low | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Cli.Tests/WriteCommandParseValueTests.cs` | The only test file in the CLI test project covers `WriteCommand.ParseValue` and `ReadCommand.SynthesiseTagName`. Two behaviours that are pure logic (testable without a device) are uncovered: (1) `AbLegacyCommandBase.BuildOptions` — that it… |
| Driver.Cli.Common-004 | Low | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:68-70` | `FormatTable` calls `rows.Max(r => r.Tag.Length)` (and the same for the value and status columns) without guarding against empty input. When `tagNames` and `snapshots` are both empty (equal length, so the mismatch check at line 56 passes),… |
| Driver.Cli.Common-006 | Low | Documentation & comments | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:71`, `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/DriverCommandBase.cs:9` | Two minor doc inaccuracies. (1) The comment at `SnapshotFormatter.cs:71` states the "source-time column is fixed-width (ISO-8601 to ms) so no max-measurement needed" — true only when every snapshot has a non-null `SourceTimestampUtc`. `For… |
| Driver.FOCAS-007 | Low | Error handling & resilience | `FocasDriver.cs:140-148`, `FocasDriver.cs:478-484`, `FocasDriver.cs:529-533`, `FocasAlarmProjection.cs:61-63` | Numerous `try { ... } catch {}` blocks swallow every exception with no logging - `ShutdownAsync` (CTS cancel/dispose), `RecycleLoopAsync` (`DisposeClient`), `FixedTreeLoopAsync` transient catches, `ProbeLoopAsync`, and the alarm projection… |
| Driver.FOCAS-008 | Low | Performance & resource management | `FocasDriver.cs:201`, `FocasDriver.cs:253` | `ReadAsync` and `WriteAsync` call `FocasAddress.TryParse(def.Address)` on every operation, even though `InitializeAsync` already parsed and validated every tag address. On a subscription hot path (each poll tick re-enters `ReadAsync`) this… |
| Driver.FOCAS-009 | Low | Design-document adherence | `FocasDriverOptions.cs:110-115`, `FocasDriver.cs:468-486`, `FocasDriverFactoryExtensions.cs:75-80` | `FocasProbeOptions.Timeout` is parsed by the factory (`FocasProbeDto.TimeoutMs` to `FocasProbeOptions.Timeout`) but never consumed. `ProbeLoopAsync` calls `client.ProbeAsync(ct)` with only the probe-loop cancellation token; no per-probe ti… |
| Driver.FOCAS-010 | Low | Code organization & conventions | `IFocasClient.cs:210-227` (`FocasOpMode`), `FocasConstants.cs:42-78` (`FocasOperationMode`) | There are two parallel operation-mode-to-text mappings with divergent labels. `FocasOpMode.ToText` (used by the driver fixed-tree `OperationMode/ModeText` node) yields `"TJOG"`, `"TEACH_IN_HANDLE"`; `FocasOperationModeExtensions.ToText` (i… |
| Driver.FOCAS-011 | Low | Code organization & conventions | `IFocasClient.cs:275-287` (`FocasAlarmType`), `FocasAlarmProjection.cs:149-175` | `FocasAlarmType` declares its constants as `public const int`, but the only consumers - `FocasAlarmProjection.MapAlarmType(short type)` and `MapSeverity(short type)` - take a `short` and `switch` against these `int` constants. It compiles… |
| Driver.FOCAS.Cli-001 | Low | Error handling & resilience | `Commands/WriteCommand.cs:58-68` | `WriteCommand.ParseValue` parses the numeric `--value` types (`Byte`/`Int16`/`Int32`/`Float32`/`Float64`) with `sbyte.Parse` / `short.Parse` / etc. These throw raw `FormatException` or `OverflowException` for malformed or out-of-range inpu… |
| Driver.FOCAS.Cli-002 | Low | Concurrency & thread safety | `Commands/SubscribeCommand.cs:45-51` | The `subscribe` command attaches an `OnDataChange` handler that calls the synchronous `console.Output.WriteLine`. `OnDataChange` is raised from the driver's `PollGroupEngine` tick thread, while the command's main flow writes the "Subscribe… |
| Driver.FOCAS.Cli-003 | Low | Error handling & resilience | `FocasCommandBase.cs:19` (`CncPort`), `FocasCommandBase.cs:27` (`TimeoutMs`), `Commands/SubscribeCommand.cs:23` (`IntervalMs`) | The numeric command options `--cnc-port`, `--timeout-ms`, and `--interval-ms` are accepted without range validation. A zero or negative `--cnc-port` produces an invalid `focas://host:<n>` string; `--timeout-ms 0` yields a zero `TimeSpan` o… |
| Driver.FOCAS.Cli-004 | Low | Performance & resource management | `Commands/ProbeCommand.cs:37,54`; `Commands/ReadCommand.cs:37,46`; `Commands/WriteCommand.cs:45,54`; `Commands/SubscribeCommand.cs:39,73` | Every command declares `await using var driver = new FocasDriver(...)` |
| Driver.FOCAS.Cli-005 | Low | Design-document adherence | `Commands/WriteCommand.cs:50`, `Commands/ProbeCommand.cs:50` (via `SnapshotFormatter.FormatStatus`) | `docs/Driver.FOCAS.Cli.md` documents `BadDeviceFailure` and `BadCommunicationError` as the key diagnostic signals an operator reads off `probe` / `write` output ("A `BadCommunicationError` means ... `BadDeviceFailure` after a successful co… |
| Driver.Galaxy-005 | Low | OtOpcUa conventions | `Runtime/EventPump.cs:81-88` | The `BoundedChannelOptions` comment states "Newest-dropped policy: when full, the producer's TryWrite returns false ... We do this manually rather than relying on `BoundedChannelFullMode.DropWrite`" — but the option is then set to `FullMod… |
| Driver.Galaxy-010 | Low | Security | `GalaxyDriver.cs:311-341` | `ResolveApiKey` supports an `env:`/`file:` indirection and otherwise treats the config string as the literal API key ("Anything else — used as the literal API key. Convenient for dev"). `GalaxyGatewayOptions`' own XML doc claims "the API k… |
| Driver.Galaxy-012 | Low | Performance & resource management | `Runtime/SubscriptionRegistry.cs:65-67`, `GalaxyDriver.cs:538`, `GalaxyDriver.cs:675` | Several hot paths are O(n^2) per call. `SubscriptionRegistry.ResolveSubscribers` does `entry.Bindings.FirstOrDefault(b => b.ItemHandle == itemHandle)` — a linear scan of the whole binding list for every event dispatch; at 50k tags this is… |
| Driver.Galaxy-013 | Low | Design-document adherence | `GalaxyDriver.cs:14-27`, `GalaxyDriver.cs:374-382`, `Config/GalaxyDriverOptions.cs:84-86` | Multiple doc comments are stale relative to the shipped code. `GalaxyDriver`'s class summary still describes the file as "the project skeleton with `IDriver` bodies that wire to a future `IGalaxyGatewayClient` abstraction. Capability inter… |
| Driver.Historian.Wonderware-004 | Low | Correctness and logic bugs | `Backend/SdkAlarmHistorianWriteBackend.cs:198-201` | `ToHistorianEvent` only assigns `historianEvent.Id` when `Guid.TryParse(dto.EventId, ...)` succeeds. If `EventId` is not a parseable GUID (or is empty), `Id` stays `Guid.Empty` and the event is written to the historian with an all-zeros id… |
| Driver.Historian.Wonderware-005 | Low | Concurrency and thread safety | `Backend/HistorianDataSource.cs:124`, `:126-127` | `GetHealthSnapshot` reads `_activeProcessNode` and `_activeEventNode` inside `_healthLock`, but those two fields are written under `_connectionLock` / `_eventConnectionLock` (lines 183, 243, 209-210, 266-269) — a different lock. The health… |
| Driver.Historian.Wonderware-007 | Low | Error handling and resilience | `Ipc/PipeServer.cs:70-75` | When `VerifyCaller` rejects the peer SID, the server logs the reason and calls `_current.Disconnect()` with no `HelloAck` frame sent. The shared-secret-mismatch and major-version-mismatch paths below it both send a rejecting `HelloAck` so… |
| Driver.Historian.Wonderware-008 | Low | Error handling and resilience | `Backend/HistorianDataSource.cs:301-307`, `:374-380` | When `query.StartQuery` returns `false`, `ReadRawAsync` and `ReadAggregateAsync` call `HandleConnectionError()` and return an empty result list. A failed `StartQuery` is not necessarily a connection failure — it can be a bad tag name, an i… |
| Driver.Historian.Wonderware-010 | Low | Performance and resource management | `Backend/HistorianConfiguration.cs:32-36`, `Backend/HistorianDataSource.cs` (all read methods) | `HistorianConfiguration.RequestTimeoutSeconds` is documented as the "outer safety timeout applied to sync-over-async Historian operations" and is copied around (`SdkAlarmHistorianWriteBackend.CloneConfigWithServerName:346`), but it is neve… |
| Driver.Historian.Wonderware-011 | Low | Design-document adherence | `Backend/HistorianDataSource.cs:9-12`, `Backend/IHistorianDataSource.cs:9-11`, `Backend/HistorianSample.cs:7-9`, `Backend/HistorianConfiguration.cs:7-9` | Several XML doc comments reference the retired v1 architecture as if it were current: "inside Galaxy.Host", "the Proxy maps returned samples", "the Host returns these across the IPC boundary as `GalaxyDataValue`", "Populated from ... the P… |
| Driver.Historian.Wonderware-012 | Low | Testing coverage | `Backend/HistorianDataSource.cs`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests/` | The unit-test suite covers `HistorianQualityMapper`, `HistorianClusterEndpointPicker`, `SdkAlarmHistorianWriteBackend`, `AahClientManagedAlarmEventWriter`, the IPC round trip, and `Program` alarm-writer wiring. `HistorianDataSource` itself… |
| Driver.Historian.Wonderware.Client-003 | Low | Concurrency & thread safety | `WonderwareHistorianClient.cs:207`, `WonderwareHistorianClient.cs:132-150` | `_totalQueries` is mutated with `Interlocked.Increment` in `Invoke`, but read inside `GetHealthSnapshot` under `_healthLock`, and every other counter (`_totalSuccesses`, `_totalFailures`, `_consecutiveFailures`) is mutated only under `_hea… |
| Driver.Historian.Wonderware.Client-004 | Low | Concurrency & thread safety | `WonderwareHistorianClient.cs:203-267` | A sidecar-reported failure is recorded in two non-atomic steps under separate lock acquisitions: `Invoke` calls `RecordSuccess()` (line 211) and then the caller calls `ThrowIfFailed` which calls `ReclassifySuccessAsFailure()` (line 256), d… |
| Driver.Historian.Wonderware.Client-006 | Low | Error handling & resilience | `Internal/PipeChannel.cs:96-107`, `WonderwareHistorianClientOptions.cs:11-12` | `PipeChannel.InvokeAsync` retries exactly once on transport failure and otherwise propagates. The options expose `ReconnectInitialBackoff` and `ReconnectMaxBackoff` and `WonderwareHistorianClientOptions` documents them as exponential backo… |
| Driver.Historian.Wonderware.Client-008 | Low | Security | `ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.csproj:29-32` | The csproj suppresses two NuGet audit advisories (`GHSA-37gx-xxp4-5rgx`, `GHSA-w3x6-4m5h-cxqf`) for the `MessagePack` 2.5.187 dependency with no inline comment recording why the suppression is safe, who reviewed it, or when it should be re… |
| Driver.Historian.Wonderware.Client-010 | Low | Documentation & comments | `WonderwareHistorianClient.cs:355-361`, `WonderwareHistorianClient.cs:132-150` | Two doc/behaviour mismatches. (1) The `Dispose()` XML comment asserts the underlying channel async cleanup is non-blocking so the `GetAwaiter()/GetResult()` bridge is safe. `PipeChannel.DisposeAsync` calls `ResetTransport()`, which invokes… |
| Driver.Modbus-003 | Low | Concurrency & thread safety | `ModbusDriver.cs:59,188,241,259,266,726,745,759` | `_health` is a non-`volatile` reference field written from multiple threads (concurrent `ReadAsync` callers, the coalesced-read path, `WriteAsync` indirectly, and `ProbeLoopAsync`) and read by `GetHealth()`. Reference assignment is atomic… |
| Driver.Modbus-007 | Low | Design-document adherence | `ModbusDriver.cs:1392`, `ModbusDriverOptions.cs:74-80` | Two design-vs-code drifts. (1) `MapDataType` maps `Int64`/`UInt64` to `DriverDataType.Int32` with the inline comment "widening to Int32 loses precision; PR 25 adds Int64 to DriverDataType". The address-space node for a 64-bit Modbus tag is… |
| Driver.Modbus-008 | Low | Documentation & comments | `ModbusDriver.cs:411-417,700-703,737-744` | Stale/misleading comments. (1) The `<summary>` block at `ModbusDriver.cs:411-417` says auto-prohibited ranges are "Cleared by ReinitializeAsync ... or by an explicit re-probe API (not yet shipped)" — the re-probe loop has shipped (#151, `R… |
| Driver.Modbus-009 | Low | Correctness & logic bugs | `ModbusDriver.cs:1160-1167`, `ModbusTcpTransport.cs:94-95` | Two edge cases. (1) `RegisterCount` for `ModbusDataType.String` computes `(tag.StringLength + 1) / 2`; a tag configured with `StringLength = 0` yields a register count of 0, flowing into `ReadOneAsync` as `totalRegs = 0` and producing an F… |
| Driver.Modbus-010 | Low | Error handling & resilience | `ModbusDriver.cs:864-868`, `ModbusDriverOptions.cs:116-125` | When `WriteOnChangeOnly` is enabled and `IsRedundantWrite` returns true, `WriteAsync` returns `WriteResult(0u)` (Good) without touching the wire. The suppression baseline (`_lastWrittenByRef`) is only invalidated by a *read* that returns a… |
| Driver.Modbus-011 | Low | Code organization & conventions | `ModbusDriver.cs:23-43,89-97,408-432` | Field and member declarations are interleaved with methods throughout `ModbusDriver`. `ResolveHost` (a public method) is the first member of the class, followed by `BuildSlaveHostName`, then a block of fields; `_lastPublishedByRef`/`_lastW… |
| Driver.Modbus-012 | Low | Testing coverage | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests/` | The unit suite is broad (coalescing, bisection, auto-recovery, byte order, arrays, BCD, RMW, caps, multi-unit, probe, reconnect, subscription). Gaps relative to the findings above: (1) no test exercises concurrent multi-subscription publis… |
| Driver.Modbus.Addressing-006 | Low | Error handling & resilience | `ModbusAddressParser.cs:297-301` | `TryParseFamilyNative` catches only `ArgumentException` and `OverflowException`. The current helpers throw only those (including `ArgumentOutOfRangeException`, which derives from `ArgumentException`), so today it is correct. But the parser… |
| Driver.Modbus.Addressing-007 | Low | Design-document adherence | `ModbusDataType.cs:91-95`, `docs/v2/dl205.md` section Strings | `ModbusStringByteOrder` (HighByteFirst / LowByteFirst) is defined in this assembly and documented as the DL205 low-byte-first string-packing knob, but `ParsedModbusAddress` has no field for it and `ModbusAddressParser` never produces or co… |
| Driver.Modbus.Addressing-009 | Low | Documentation & comments | `ModbusModiconAddress.cs:55-64`, `ModbusModiconAddress.cs:104-110` | The comments on `ModbusModiconAddress.TryParse` are slightly inaccurate. The remark that 5-digit Modicon is always exactly 5 chars (40001..49999) and 6-digit is exactly 6 (400001..465536-shaped) implies the leading digit is always 4, but t… |
| Driver.Modbus.Cli-003 | Low | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/ModbusCommandBase.cs:14-24` | `Port` (`int`) and `TimeoutMs` (`int`) accept any 32-bit value, including negatives and ports above 65535. `UnitId` is a `byte`, so it accepts 0-255 even though the option description and `docs/Driver.Modbus.Cli.md` both say the valid rang… |
| Driver.Modbus.Cli-004 | Low | Concurrency & thread safety | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/SubscribeCommand.cs:61-67` | The `OnDataChange` handler is invoked from the driver's `PollGroupEngine` background thread and calls `console.Output.WriteLine` synchronously. An exception thrown inside this handler (e.g. an `IOException` on a redirected or closed stdout… |
| Driver.Modbus.Cli-005 | Low | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/ProbeCommand.cs:21-54`; `Commands/ReadCommand.cs:46-75`; `Commands/WriteCommand.cs:54-89` | All three commands call `ConfigureLogging()` then `console.RegisterCancellationHandler()`, but if the operator presses Ctrl+C before `InitializeAsync` completes, the resulting `OperationCancelledException` propagates out of `ExecuteAsync`… |
| Driver.Modbus.Cli-006 | Low | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/ProbeCommand.cs:35-53` | `probe` reports `Health: {health.State}` from `GetHealth()`. After a successful `InitializeAsync` the driver sets state to `Healthy` regardless of whether the subsequent probe register read returns Good or a Bad status code. `ReadAsync` do… |
| Driver.Modbus.Cli-007 | Low | Design-document adherence | `docs/Driver.Modbus.Cli.md:124-156`; `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/ReadCommand.cs` | `docs/Driver.Modbus.Cli.md` devotes a whole "v2 addressing grammar" section to the industry-standard tag-address strings (`40001:F:CDAB`, `HR1:I`, `C100`, `V2000:F:CDAB`, etc.) and says "set the per-tag `addressString` field instead of the… |
| Driver.Modbus.Cli-008 | Low | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli.Tests/` | The test project covers only the two pure-function seams: `ReadCommand.SynthesiseTagName` and `WriteCommand.ParseValue`. There is no coverage for `WriteCommand`'s read-only-region rejection (`Region is not (Coils or HoldingRegisters)`), no… |
| Driver.OpcUaClient-011 | Low | Documentation & comments | `OpcUaClientDriver.cs:783-784` | The comment on the isArray computation states "-1 = scalar; 1+ = array dimensions; 0 = one-dimensional array". This is inaccurate against OPC UA ValueRank semantics: -3 is ScalarOrOneDimension, -2 is Any, -1 is Scalar, and 0 is OneOrMoreDi… |
| Driver.OpcUaClient-014 | Low | Performance & resource management | `OpcUaClientDriver.cs:904`, `:1035` | `MonitoredItem.Notification += (mi, args) => ...` (and the alarm-event equivalent) attaches a closure-capturing lambda to each monitored item's event. The lambda is never detached. When UnsubscribeAsync removes a subscription it calls Subs… |
| Driver.S7-003 | Low | Correctness & logic bugs | `S7Driver.cs:172`, `S7Driver.cs:255` | ReadAsync and WriteAsync dereference fullReferences.Count / writes.Count with no null guard. A null argument throws NullReferenceException rather than ArgumentNullException, and the NRE escapes before the _gate is taken so it is not wrappe… |
| Driver.S7-005 | Low | OtOpcUa conventions | `S7Driver.cs:33`, `S7Driver.cs:433` | System.Collections.Concurrent.ConcurrentDictionary is written out with a fully-qualified namespace at the field declarations instead of a using System.Collections.Concurrent directive. ImplicitUsings is enabled and the rest of the codebase… |
| Driver.S7-009 | Low | Error handling & resilience | `S7Driver.cs:392` | The subscription poll loop never reflects sustained polling failure anywhere an operator can see it. PollLoopAsync swallows every non-cancellation exception with an empty catch and the comment claims "the health surface reflects it" - but… |
| Driver.S7-010 | Low | Performance & resource management | `S7Driver.cs:504` | Dispose() is implemented as DisposeAsync().AsTask().GetAwaiter().GetResult() - sync-over-async. Inside the generic host this is currently safe (no captured SynchronizationContext), but it is a known deadlock pattern. The only async work be… |
| Driver.S7-013 | Low | Code organization & conventions | `S7DriverOptions.cs:90`, `S7Driver.cs:300` | S7TagDefinition.StringLength is a public configured/JSON-bound parameter (default 254) but is dead: S7DataType.String reads and writes both throw NotSupportedException ("...land in a follow-up PR"), so StringLength is never consumed. Likew… |
| Driver.S7.Cli-004 | Low | Performance & resource management | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/ProbeCommand.cs:36,53`, `Commands/ReadCommand.cs:45,54`, `Commands/WriteCommand.cs:51,60`, `Commands/SubscribeCommand.cs:39,73` | Every command declares the driver with `await using var driver = new S7Driver(...)` and *also* calls `await driver.ShutdownAsync(...)` in a `finally` block. `S7Driver.DisposeAsync` itself calls `ShutdownAsync`, so shutdown runs twice per c… |
| Driver.S7.Cli-005 | Low | Code organization & conventions | `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/` | A stale directory `tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/` exists containing only an `obj/` folder — no `.csproj`, no source. The real test project lives at `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/`. The empty direct… |
| Driver.S7.Cli-006 | Low | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli.Tests/WriteCommandParseValueTests.cs` | The only test file covers `WriteCommand.ParseValue` and `ReadCommand.SynthesiseTagName`. `S7CommandBase.BuildOptions` — which maps the host / port / CPU / rack / slot / timeout flags onto an `S7DriverOptions` and forces `Probe.Enabled = fa… |
| Driver.S7.Cli-007 | Low | Documentation & comments | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/SubscribeCommand.cs:45-51` | The Modbus CLI `SubscribeCommand` carries an explanatory comment on the `OnDataChange` handler ("Route every data-change event to the CliFx console (not System.Console — the analyzer flags it + IConsole is the testable abstraction)"). The… |
| Driver.TwinCAT-004 | Low | Correctness & logic bugs | `TwinCATDataType.cs:24-27` | The inline comments for the IEC time types are inaccurate. TwinCAT `TIME` is a duration (32-bit, milliseconds) — not "ms since epoch of day". `DATE` is stored as seconds since 1970-01-01 (truncated to a day boundary), not "days since 1970-… |
| Driver.TwinCAT-006 | Low | OtOpcUa conventions | `TwinCATDriver.cs:406-411` | `ResolveHost` falls back to `DriverInstanceId` when there are no configured devices and the reference is unknown. `DriverInstanceId` is a logical config-DB identifier, not a host address; `IPerCallHostResolver` consumers expect a host key… |
| Driver.TwinCAT-014 | Low | Design-document adherence | `TwinCATDriverOptions.cs:41-43`, `TwinCATDriverOptions.cs:57-62`, `AdsTwinCATClient.cs:145` | Several drifts between the implemented config surface and `docs/v2/driver-specs.md` section 6. The spec connection-settings list has separate `Host` (IP), `AmsNetId`, and `AmsPort` fields; the implementation collapses these into a single `… |
| Driver.TwinCAT-015 | Low | Code organization & conventions | `TwinCATDriver.cs:431-432` | `Dispose()` runs `DisposeAsync().AsTask().GetAwaiter().GetResult()` — sync-over-async. `docs/v2/driver-stability.md` section Galaxy explicitly lists "sync-over-async on the OPC UA stack thread" among the four 2026-04-13 stability findings… |
| Driver.TwinCAT-016 | Low | Testing coverage | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/` | Unit coverage exists for AMS-address parsing, symbol-path parsing, read/write, native notifications, symbol browse, and the capability surface. Gaps tied to the findings above: no test exercises `ReinitializeAsync` with a changed config (D… |
| Driver.TwinCAT.Cli-001 | Low | Correctness & logic bugs | `TwinCATCommandBase.cs:23-24`, `Commands/SubscribeCommand.cs:23-24`, `Commands/BrowseCommand.cs:21-24` | Numeric command options are accepted without range validation. `--timeout-ms` feeds `Timeout => TimeSpan.FromMilliseconds(TimeoutMs)`; passing `--timeout-ms 0` or a negative value yields `TimeSpan.Zero`/a negative `TimeSpan`, which is then… |
| Driver.TwinCAT.Cli-002 | Low | Concurrency & thread safety | `Commands/SubscribeCommand.cs:46-58` | The `OnDataChange` handler calls `console.Output.WriteLine(line)` synchronously. In native ADS-notification mode the event is raised from the `Beckhoff.TwinCAT.Ads` notification callback thread (see `TwinCATDriver.SubscribeAsync`, which in… |
| Driver.TwinCAT.Cli-003 | Low | Error handling & resilience | `Commands/SubscribeCommand.cs:56-58` | The subscribe banner reports the mechanism purely from the `--poll-only` flag (`var mode = PollOnly ? "polling" : "ADS notification"`). The doc (`docs/Driver.TwinCAT.Cli.md`) states the banner "announces which mechanism is in play". The CL… |
| Driver.TwinCAT.Cli-004 | Low | Design-document adherence | `TwinCATCommandBase.cs:26-29`, `Commands/BrowseCommand.cs` | `--poll-only` is declared on `TwinCATCommandBase`, so it is inherited by `browse`. `BrowseCommand` only ever calls `DiscoverAsync` — it never subscribes — so `UseNativeNotifications = !PollOnly` has no observable effect on a browse run. Th… |
| Driver.TwinCAT.Cli-005 | Low | Code organization & conventions | `Commands/ProbeCommand.cs:23`, `Commands/ReadCommand.cs:20`, `Commands/WriteCommand.cs:20`, `Commands/SubscribeCommand.cs:18` | The `--type` option is declared with the short alias `-t` on `read`, `write`, and `subscribe`, but `ProbeCommand` declares `[CommandOption("type", ...)]` with no short alias. An operator who has internalised `-t` from the other three verbs… |
| Driver.TwinCAT.Cli-006 | Low | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Cli.Tests/WriteCommandParseValueTests.cs` | The only test file covers `WriteCommand.ParseValue` and `ReadCommand.SynthesiseTagName`. Other deterministic, router-independent logic is untested: `TwinCATCommandBase.Gateway` (the `ads://{netId}:{port}` string the driver's `TwinCATAmsAdd… |
| Driver.TwinCAT.Cli-007 | Low | Documentation & comments | `TwinCATCommandBase.cs:31-36` | The `Timeout` override has an empty `init` accessor with the comment `/* driven by TimeoutMs */`. Because the base `DriverCommandBase.Timeout` is declared `abstract { get; init; }`, the override must supply an `init`, but here it silently… |
| Server-004 | Low | OtOpcUa conventions | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:187-200` | `RoleBasedIdentity` declares its own `Display` property, but the base `UserIdentity` already has a settable `DisplayName`. `DriverNodeManager.ResolveCallUser`/`RouteScriptedAlarmMethodCalls` read the base `DisplayName`, never `Display`. Si… |
| Server-006 | Low | Concurrency & thread safety | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:478-482, 1342-1348` | `OnReadValue`/`OnWriteValue` are synchronous stack hooks that block on async driver calls via `.GetAwaiter().GetResult()` with `CancellationToken.None`. With `MaxRequestThreadCount = 100`, a burst of reads/writes into a stalled driver pins… |
| Server-008 | Low | Error handling & resilience | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:736` | `RouteScriptedAlarmMethodCalls` marks a handled slot by setting `errors[i] = ServiceResult.Good`, assuming `base.Call` skips non-null *Good* error slots. The stack and `GateCallMethodRequests` only ever pre-populate *Bad* slots; the skip-o… |
| Server-012 | Low | Performance & resource management | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79` | `ProbeAsync` creates an `IHttpClientFactory` client and mutates `client.Timeout` on every 2-second probe tick. The timeout belongs on the request or on the named-client registration, not set per call on a factory-vended instance. |
| Server-014 | Low | Code organization & conventions | `src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs` | `SealedBootstrap` claims in its xml-doc to "close release blocker #2" by consuming the generation-sealed cache + resilient reader + stale-config flag, but `Program.cs` registers and uses `NodeBootstrap` instead. `SealedBootstrap` is never… |
| Server-015 | Low | Documentation & comments | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26` | `OtOpcUaServer`'s class doc still says "PR 16 minimum-viable scope ... no security ... LDAP + security profiles are deferred." `OpcUaServerOptions`'s says "PR 17 minimum-viable scope: no LDAP, no security profiles beyond None." Both are st… |
## Closed findings
Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| ID | Severity | Status | Category | Location |
|---|---|---|---|---|
| Admin-001 | Critical | Resolved | Security | `Components/Routes.razor:4-11`, `Program.cs:150` |
| Admin-002 | Critical | Resolved | Security | `Components/Pages/Clusters/NewCluster.razor:1-7`, `Home.razor`, `Fleet.razor`, `Hosts.razor`, `AlarmsHistorian.razor`, `Clusters/ClustersList.razor`, `Clusters/Generations.razor`, `Drivers/FocasDetail.razor` |
| Core.AlarmHistorian-001 | Critical | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:255-278` |
| Core.Scripting-001 | Critical | Resolved | Security | `ForbiddenTypeAnalyzer.cs:45`, `ScriptSandbox.cs:54` |
| Driver.Galaxy-001 | Critical | Resolved | Error handling & resilience | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` |
| Server-001 | Critical | Resolved | Correctness & logic bugs | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:1791` |
| Admin-003 | High | Resolved | Security | `Program.cs:137-139`, `Hubs/FleetStatusHub.cs:11`, `Hubs/AlertHub.cs:10`, `Hubs/ScriptLogHub.cs:30` |
| Admin-004 | High | Resolved | Security | `appsettings.json:3,13-14` |
| Admin-005 | High | Resolved | Correctness & logic bugs | `Components/Pages/Login.razor:15,107-110` |
| Admin-013 | High | Resolved | Error handling & resilience | `Components/Pages/Clusters/ClusterDetail.razor:180-197`, `Components/Pages/Clusters/AclsTab.razor`, `Components/Pages/Clusters/RedundancyTab.razor`, `Components/Pages/RoleGrants.razor`, `Components/Pages/Hosts.razor`, `Components/Pages/ScriptLog.razor`, `Program.cs:157-159` |
| Client.Shared-005 | High | Resolved | Concurrency & thread safety | `OpcUaClientService.cs:19`, `OpcUaClientService.cs:226-249`, `OpcUaClientService.cs:499-521` |
| Client.Shared-006 | High | Resolved | Concurrency & thread safety | `OpcUaClientService.cs:97-100`, `OpcUaClientService.cs:432-497` |
| Configuration-001 | High | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:282` |
| Configuration-008 | High | Resolved | Security | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:150`, `:373`, `:468` |
| Core-001 | High | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs:50-68` |
| Core-002 | High | Resolved | Security | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs:24-50` |
| Core.AlarmHistorian-002 | High | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:99-105,386-388` |
| Core.AlarmHistorian-004 | High | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:90,112,176,259` |
| Core.AlarmHistorian-006 | High | Resolved | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:103,135-216` |
| Core.ScriptedAlarms-001 | High | Resolved | Concurrency & thread safety | `ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368` |
| Core.Scripting-002 | High | Resolved | Security | `ForbiddenTypeAnalyzer.cs:70` |
| Core.VirtualTags-001 | High | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:306` |
| Driver.AbCip-001 | High | Resolved | Correctness & logic bugs | `AbCipDriver.cs:111`, `AbCipDriver.cs:163-167` |
| Driver.AbCip-002 | High | Resolved | Correctness & logic bugs | `AbCipStatusMapper.cs:65-78` |
| Driver.AbCip-003 | High | Resolved | Correctness & logic bugs | `AbCipUdtMemberLayout.cs:32-54`, `AbCipDriver.cs:426-430`, `AbCipUdtReadPlanner.cs:48` |
| Driver.AbCip-008 | High | Resolved | Concurrency & thread safety | `AbCipDriver.cs:144-152`, `AbCipDriver.cs:169-183`, `AbCipDriver.cs:235-281` |
| Driver.AbLegacy-001 | High | Resolved | Correctness & logic bugs | `AbLegacyAddress.cs:54`, `AbLegacyDriver.cs:368-374` |
| Driver.AbLegacy-006 | High | Resolved | Concurrency & thread safety | `AbLegacyDriver.cs:107-158`, `AbLegacyDriver.cs:162-234`, `LibplctagLegacyTagRuntime.cs` |
| Driver.Cli.Common-001 | High | Resolved | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:106-119` |
| Driver.FOCAS-001 | High | Resolved | Correctness & logic bugs | `FocasDriverFactoryExtensions.cs:54-86`, `FocasDriverFactoryExtensions.cs:132-140` |
| Driver.FOCAS-002 | High | Resolved | Correctness & logic bugs | `WireFocasClient.cs:164-179`, `FocasDriver.cs:513`, `FocasDriver.cs:593` |
| Driver.Galaxy-002 | High | Resolved | Correctness & logic bugs | `Browse/DataTypeMap.cs:13`, `Runtime/MxValueDecoder.cs:9` |
| Driver.Galaxy-008 | High | Resolved | Error handling & resilience | `GalaxyDriver.cs:264-276`, `Runtime/EventPump.cs:97-103` |
| Driver.Historian.Wonderware-001 | High | Resolved | Correctness and logic bugs | `Backend/SdkAlarmHistorianWriteBackend.cs:68`, `Backend/AahClientManagedAlarmEventWriter.cs:82-103` |
| Driver.Historian.Wonderware.Client-001 | High | Resolved | Correctness & logic bugs | `WonderwareHistorianClient.cs:98-113` |
| Driver.Modbus-001 | High | Resolved | Concurrency & thread safety | `ModbusDriver.cs:92,99-122` |
| Driver.Modbus.Addressing-001 | High | Resolved | Correctness & logic bugs | `ModbusAddressParser.cs:230-235`, `DirectLogicAddress.cs:66-73` |
| Driver.OpcUaClient-001 | High | Resolved | Correctness & logic bugs | `OpcUaClientDriver.cs:444`, `:466`, `:517`, `:540`, `:599`, `:610` |
| Driver.OpcUaClient-002 | High | Resolved | Error handling & resilience | `OpcUaClientDriver.cs:1330-1359` |
| Driver.OpcUaClient-003 | High | Resolved | Correctness & logic bugs | `OpcUaClientDriver.cs:644-711` |
| Driver.OpcUaClient-004 | High | Resolved | Design-document adherence | `OpcUaClientDriver.cs:596-632`, `:789`, `OpcUaClientDriverOptions.cs` |
| Driver.OpcUaClient-005 | High | Resolved | Concurrency & thread safety | `OpcUaClientDriver.cs:1297-1319` |
| Driver.S7-001 | High | Resolved | Correctness & logic bugs | `S7AddressParser.cs:93`, `S7Driver.cs:231` |
| Driver.S7-006 | High | Resolved | Concurrency & thread safety | `S7Driver.cs:140`, `S7Driver.cs:457`, `S7Driver.cs:506` |
| Driver.S7-007 | High | Resolved | Error handling & resilience | `S7Driver.cs:200`, `S7DriverOptions.cs:13`, `docs/v2/driver-specs.md:434` |
| Driver.S7-011 | High | Resolved | Design-document adherence | `S7Driver.cs:82`, `S7Driver.cs:134`, `IDriver.cs:24` |
| Driver.TwinCAT-001 | High | Resolved | Correctness & logic bugs | `TwinCATDriver.cs:41-78` |
| Driver.TwinCAT-002 | High | Resolved | Correctness & logic bugs | `TwinCATDataType.cs:34-48`, `AdsTwinCATClient.cs:264-281` |
| Driver.TwinCAT-007 | High | Resolved | Concurrency & thread safety | `TwinCATDriver.cs:413-429` |
| Driver.TwinCAT-008 | High | Resolved | Concurrency & thread safety | `AdsTwinCATClient.cs:162-169`, `TwinCATDriver.cs:319-324` |
| Driver.TwinCAT-013 | High | Resolved | Design-document adherence | `TwinCATDriver.cs:11-12` (capability list), whole file |
| Server-002 | High | Resolved | Correctness & logic bugs | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs:60-63` |
| Server-009 | High | Resolved | Security | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/LdapOptions.cs:44`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:74` |
| Admin-006 | Medium | Resolved | Security | `Components/Layout/MainLayout.razor:47-49`, `Program.cs:129,131-135` |
| Admin-007 | Medium | Resolved | Design-document adherence | `Components/Pages/Clusters/NewCluster.razor:91,95-96` |
| Admin-008 | Medium | Resolved | Error handling & resilience | `Services/ReservationService.cs:28-37` |
| Admin-009 | Medium | Resolved | Testing coverage | `src/Server/ZB.MOM.WW.OtOpcUa.Admin` (whole module) |
| Analyzers-001 | Medium | Resolved | Correctness & logic bugs | `src/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs:135-139` |
| Analyzers-006 | Medium | Resolved | Testing coverage | `tests/Tooling/ZB.MOM.WW.OtOpcUa.Analyzers.Tests/UnwrappedCapabilityCallAnalyzerTests.cs` |
| Client.CLI-001 | Medium | Resolved | Correctness & logic bugs | `Commands/HistoryReadCommand.cs:73`, `Commands/HistoryReadCommand.cs:76` |
| Client.CLI-005 | Medium | Resolved | Concurrency & thread safety | `Commands/SubscribeCommand.cs:66-78`, `Commands/AlarmsCommand.cs:52-64` |
| Client.Shared-001 | Medium | Resolved | Correctness & logic bugs | `OpcUaClientService.cs:552` |
| Client.Shared-002 | Medium | Resolved | Correctness & logic bugs | `OpcUaClientService.cs:351-355`, `OpcUaClientService.cs:373` |
| Client.Shared-007 | Medium | Resolved | Concurrency & thread safety | `OpcUaClientService.cs:581-622` |
| Client.Shared-008 | Medium | Resolved | Error handling & resilience | `OpcUaClientService.cs:170-180`, `Helpers/ValueConverter.cs:15-31` |
| Client.UI-001 | Medium | Resolved | Correctness & logic bugs | `ViewModels/HistoryViewModel.cs:76`, `ViewModels/HistoryViewModel.cs:77` |
| Client.UI-002 | Medium | Resolved | Correctness & logic bugs | `ViewModels/MainWindowViewModel.cs:255`, `ViewModels/MainWindowViewModel.cs:333` |
| Client.UI-005 | Medium | Resolved | Concurrency & thread safety | `ViewModels/MainWindowViewModel.cs:286-304`, `ViewModels/MainWindowViewModel.cs:155-189` |
| Client.UI-007 | Medium | Resolved | Security | `Services/UserSettings.cs:22-23`, `Services/JsonSettingsService.cs:38-50`, `ViewModels/MainWindowViewModel.cs:393-408` |
| Client.UI-008 | Medium | Resolved | Performance & resource management | `ViewModels/MainWindowViewModel.cs:18`, `ViewModels/MainWindowViewModel.cs:125-148`, `App.axaml.cs:18-32` |
| Configuration-002 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260417215224_StoredProcedures.cs:325` |
| Configuration-003 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs:73` |
| Configuration-006 | Medium | Resolved | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs:79` |
| Configuration-009 | Medium | Resolved | Security | `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/DesignTimeDbContextFactory.cs:14` |
| Core-003 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs:80-98` |
| Core-005 | Medium | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs:59-70` |
| Core-006 | Medium | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs:42-64` |
| Core-007 | Medium | Resolved | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs:75-83` |
| Core.Abstractions-001 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:112` |
| Core.Abstractions-002 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:105-109` |
| Core.Abstractions-003 | Medium | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/PollGroupEngine.cs:64,121-130` |
| Core.AlarmHistorian-003 | Medium | Resolved | OtOpcUa conventions | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:107-127,218-243,246-253` |
| Core.AlarmHistorian-005 | Medium | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:66-71,141-143,199,386-388` |
| Core.AlarmHistorian-007 | Medium | Resolved | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:172-174` |
| Core.AlarmHistorian-009 | Medium | Resolved | Design-document adherence | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:317-347` |
| Core.AlarmHistorian-010 | Medium | Resolved | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian.Tests/SqliteStoreAndForwardSinkTests.cs` |
| Core.ScriptedAlarms-002 | Medium | Resolved | Correctness & logic bugs | `ScriptedAlarmEngine.cs:162`, `ScriptedAlarmEngine.cs:90` |
| Core.ScriptedAlarms-004 | Medium | Resolved | Concurrency & thread safety | `ScriptedAlarmEngine.cs:138-143`, `ScriptedAlarmEngine.cs:227-234` |
| Core.ScriptedAlarms-005 | Medium | Resolved | Concurrency & thread safety | `ScriptedAlarmEngine.cs:365-369`, `ScriptedAlarmEngine.cs:416-424` |
| Core.ScriptedAlarms-007 | Medium | Resolved | Error handling & resilience | `ScriptedAlarmEngine.cs:216`, `ScriptedAlarmEngine.cs:251`, `ScriptedAlarmEngine.cs:154`, `ScriptedAlarmEngine.cs:387` |
| Core.ScriptedAlarms-012 | Medium | Resolved | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms.Tests/ScriptedAlarmEngineTests.cs` |
| Core.Scripting-003 | Medium | Resolved | Security | `TimedScriptEvaluator.cs:9`, `ScriptSandbox.cs:30` |
| Core.Scripting-004 | Medium | Resolved | Correctness & logic bugs | `DependencyExtractor.cs:73` |
| Core.Scripting-007 | Medium | Resolved | Error handling & resilience | `TimedScriptEvaluator.cs:60` |
| Core.Scripting-010 | Medium | Resolved | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/ScriptSandboxTests.cs:54` |
| Core.VirtualTags-002 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:237` |
| Core.VirtualTags-003 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:117-120` |
| Core.VirtualTags-005 | Medium | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagSource.cs:50-64` |
| Core.VirtualTags-008 | Medium | Resolved | Performance & resource management | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/DependencyGraph.cs:81-115` |
| Core.VirtualTags-012 | Medium | Resolved | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags.Tests/` |
| Driver.AbCip-004 | Medium | Resolved | Correctness & logic bugs | `AbCipDataType.cs:51-58`, `LibplctagTagRuntime.cs:47-49,53` |
| Driver.AbCip-005 | Medium | Resolved | Correctness & logic bugs | `AbCipDriver.cs:124-141` |
| Driver.AbCip-006 | Medium | Resolved | OtOpcUa conventions | `PlcTagHandle.cs:28-59`, `AbCipDriver.cs:806-807,832-833`, `LibplctagTagRuntime.cs:117` |
| Driver.AbCip-009 | Medium | Resolved | Concurrency & thread safety | `AbCipDriver.cs:621-648`, `AbCipDriver.cs:591-614` |
| Driver.AbCip-010 | Medium | Resolved | Error handling & resilience | `AbCipDriver.cs:621-648`, `AbCipDriver.cs:346-391` |
| Driver.AbCip-014 | Medium | Resolved | Testing coverage | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipStatusMapperTests.cs:28-40` |
| Driver.AbCip.Cli-001 | Medium | Resolved | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/WriteCommand.cs:70-85` |
| Driver.AbCip.Cli-002 | Medium | Resolved | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Cli/Commands/ProbeCommand.cs:21-23`; `Commands/ReadCommand.cs:24-25`; `Commands/SubscribeCommand.cs:20-22` |
| Driver.AbLegacy-002 | Medium | Resolved | Correctness & logic bugs | `AbLegacyDriver.cs:368` |
| Driver.AbLegacy-003 | Medium | Resolved | Correctness & logic bugs | `AbLegacyAddress.cs:62-95` |
| Driver.AbLegacy-004 | Medium | Resolved | Correctness & logic bugs | `LibplctagLegacyTagRuntime.cs:36-37` |
| Driver.AbLegacy-007 | Medium | Resolved | Concurrency & thread safety | `AbLegacyDriver.cs:411-438`, `AbLegacyDriver.cs:386-409` |
| Driver.AbLegacy-008 | Medium | Resolved | Concurrency & thread safety | `AbLegacyDriver.cs:21`, `AbLegacyDriver.cs:138-146`, `AbLegacyDriver.cs:216-229` |
| Driver.AbLegacy-009 | Medium | Resolved | Error handling & resilience | `AbLegacyDriver.cs:41-74` |
| Driver.AbLegacy-010 | Medium | Resolved | Error handling & resilience | `AbLegacyStatusMapper.cs:26-56` |
| Driver.AbLegacy-012 | Medium | Resolved | Design-document adherence | `PlcFamilies/AbLegacyPlcFamilyProfile.cs:7-54`, `AbLegacyDriver.cs:48-52` |
| Driver.AbLegacy.Cli-001 | Medium | Resolved | Error handling & resilience | `Commands/WriteCommand.cs:46`, `Commands/WriteCommand.cs:62-72` |
| Driver.Cli.Common-002 | Medium | Resolved | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:101-122` |
| Driver.Cli.Common-003 | Medium | Resolved | Concurrency & thread safety | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/DriverCommandBase.cs:51-59` |
| Driver.Cli.Common-005 | Medium | Resolved | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common.Tests/SnapshotFormatterTests.cs:27-37` |
| Driver.FOCAS-003 | Medium | Resolved | Correctness & logic bugs | `FocasDriver.cs:71-79` |
| Driver.FOCAS-004 | Medium | Resolved | OtOpcUa conventions | `FocasDriver.cs:374-379`, `WireFocasClient.cs:48-50` |
| Driver.FOCAS-005 | Medium | Resolved | Concurrency & thread safety | `FocasDriver.cs:28`, `FocasDriver.cs:206-215`, `FocasDriver.cs:261`, `FocasDriver.cs:274` |
| Driver.FOCAS-006 | Medium | Resolved | Error handling & resilience | `FocasDriver.cs:859-874`, `WireFocasClient.cs:22-31` |
| Driver.FOCAS-012 | Medium | Resolved | Testing coverage | `FocasDriverFactoryExtensions.cs`, `FocasDriver.cs:495-629` (`FixedTreeLoopAsync`) |
| Driver.Galaxy-003 | Medium | Resolved | Correctness & logic bugs | `Runtime/StatusCodeMap.cs:86` |
| Driver.Galaxy-004 | Medium | Resolved | Correctness & logic bugs | `GalaxyDriver.cs:901` |
| Driver.Galaxy-006 | Medium | Resolved | Concurrency & thread safety | `GalaxyDriver.cs:848-861` |
| Driver.Galaxy-007 | Medium | Resolved | Concurrency & thread safety | `GalaxyDriver.cs:937-968` |
| Driver.Galaxy-009 | Medium | Resolved | Error handling & resilience | `GalaxyDriver.cs:354-371` |
| Driver.Galaxy-011 | Medium | Resolved | Performance & resource management | `GalaxyDriver.cs:411` |
| Driver.Galaxy-014 | Medium | Resolved | Testing coverage | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` (module-wide) |
| Driver.Historian.Wonderware-002 | Medium | Resolved | Correctness and logic bugs | `Ipc/HistorianFrameHandler.cs:162`, `:181` |
| Driver.Historian.Wonderware-003 | Medium | Resolved | Correctness and logic bugs | `Backend/HistorianDataSource.cs:320-323`, `:457-460` |
| Driver.Historian.Wonderware-006 | Medium | Resolved | Error handling and resilience | `Ipc/PipeServer.cs:120-128` |
| Driver.Historian.Wonderware-009 | Medium | Resolved | Performance and resource management | `Backend/HistorianDataSource.cs:382-395`, `Ipc/Contracts.cs:85-99` |
| Driver.Historian.Wonderware.Client-002 | Medium | Resolved | Correctness & logic bugs | `WonderwareHistorianClient.cs:154-199`, `IAlarmHistorianSink.cs:66-74` |
| Driver.Historian.Wonderware.Client-005 | Medium | Resolved | Error handling & resilience | `Ipc/FrameReader.cs:31-32` |
| Driver.Historian.Wonderware.Client-007 | Medium | Resolved | Security | `WonderwareHistorianClient.cs:276` |
| Driver.Historian.Wonderware.Client-009 | Medium | Resolved | Testing coverage | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests/WonderwareHistorianClientTests.cs` |
| Driver.Modbus-002 | Medium | Resolved | Correctness & logic bugs | `ModbusDriver.cs:127-186` |
| Driver.Modbus-004 | Medium | Resolved | Performance & resource management | `ModbusDriver.cs:1468-1473` |
| Driver.Modbus-005 | Medium | Resolved | Correctness & logic bugs | `ModbusDriver.cs:777-798,323-330` |
| Driver.Modbus-006 | Medium | Resolved | Error handling & resilience | `ModbusDriver.cs:514-524,532-550` |
| Driver.Modbus.Addressing-002 | Medium | Resolved | Correctness & logic bugs | `ModbusAddressParser.cs:86-94` |
| Driver.Modbus.Addressing-003 | Medium | Resolved | Correctness & logic bugs | `ModbusAddressParser.cs:405-406`, `ModbusAddressParser.cs:128` |
| Driver.Modbus.Addressing-004 | Medium | Resolved | Correctness & logic bugs | `ModbusAddressParser.cs:182-194` |
| Driver.Modbus.Addressing-005 | Medium | Resolved | Error handling & resilience | `ModbusAddressParser.cs:200-213` |
| Driver.Modbus.Addressing-008 | Medium | Resolved | Testing coverage | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Addressing.Tests/` |
| Driver.Modbus.Cli-001 | Medium | Resolved | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/SubscribeCommand.cs:43-51` |
| Driver.Modbus.Cli-002 | Medium | Resolved | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Cli/Commands/WriteCommand.cs:54-89` |
| Driver.OpcUaClient-006 | Medium | Resolved | Concurrency & thread safety | `OpcUaClientDriver.cs:1330-1359` |
| Driver.OpcUaClient-007 | Medium | Resolved | Concurrency & thread safety | `OpcUaClientDriver.cs:1374`, `:1376-1383`, `:508` |
| Driver.OpcUaClient-008 | Medium | Resolved | Error handling & resilience | `OpcUaClientDriver.cs:1092-1099` |
| Driver.OpcUaClient-009 | Medium | Resolved | Error handling & resilience | `OpcUaClientDriver.cs:560-564` |
| Driver.OpcUaClient-010 | Medium | Resolved | Correctness & logic bugs | `OpcUaClientDriver.cs:823-824` |
| Driver.OpcUaClient-012 | Medium | Resolved | Security | `OpcUaClientDriver.cs:210-217` |
| Driver.OpcUaClient-013 | Medium | Resolved | Performance & resource management | `OpcUaClientDriver.cs:436-437` |
| Driver.OpcUaClient-015 | Medium | Resolved | Testing coverage | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/*`, `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests/OpcUaClientSmokeTests.cs` |
| Driver.S7-002 | Medium | Resolved | Correctness & logic bugs | `S7Driver.cs:350` |
| Driver.S7-004 | Medium | Resolved | OtOpcUa conventions | `S7Driver.cs` (whole file) |
| Driver.S7-008 | Medium | Resolved | Error handling & resilience | `S7Driver.cs:286` |
| Driver.S7-012 | Medium | Resolved | Design-document adherence | `S7DriverOptions.cs:59`, `S7Driver.cs:457` |
| Driver.S7-014 | Medium | Resolved | Testing coverage | `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/` |
| Driver.S7.Cli-001 | Medium | Resolved | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/WriteCommand.cs:65-80` |
| Driver.S7.Cli-002 | Medium | Resolved | Design-document adherence | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/ReadCommand.cs:22-29`, `Commands/WriteCommand.cs:21-33`, `Commands/SubscribeCommand.cs:18-21`; `docs/Driver.S7.Cli.md:70-73,80-81` |
| Driver.S7.Cli-003 | Medium | Resolved | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.S7.Cli/Commands/ProbeCommand.cs:38-50` |
| Driver.TwinCAT-003 | Medium | Resolved | Correctness & logic bugs | `AdsTwinCATClient.cs:264-281`, `283-300` |
| Driver.TwinCAT-005 | Medium | Resolved | OtOpcUa conventions | `TwinCATDriver.cs` (whole file), `AdsTwinCATClient.cs` (whole file) |
| Driver.TwinCAT-009 | Medium | Resolved | Concurrency & thread safety | `TwinCATDriver.cs:80-99`, `41-72`, `366-388` |
| Driver.TwinCAT-010 | Medium | Resolved | Error handling & resilience | `AdsTwinCATClient.cs:178-195` |
| Driver.TwinCAT-011 | Medium | Resolved | Error handling & resilience | `TwinCATStatusMapper.cs:29-42` |
| Driver.TwinCAT-012 | Medium | Resolved | Performance & resource management | `TwinCATDriver.cs:102`, `AdsTwinCATClient.cs:178-195` |
| Server-003 | Medium | Resolved | Correctness & logic bugs | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Phase7/RingBufferHistoryWriter.cs:96-119` |
| Server-005 | Medium | Resolved | Concurrency & thread safety | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Alarms/AlarmConditionService.cs:166`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:303-311` |
| Server-007 | Medium | Resolved | Error handling & resilience | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:179-183` |
| Server-010 | Medium | Resolved | Security | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:59`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:284-291` |
| Server-011 | Medium | Resolved | Security | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:322-346` |
| Server-013 | Medium | Resolved | Design-document adherence | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:9-19`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:296-346`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:89` |
+237
View File
@@ -0,0 +1,237 @@
# Code Review — Server
| Field | Value |
|---|---|
| Module | `src/Server/ZB.MOM.WW.OtOpcUa.Server` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Server-001, Server-002, Server-003 |
| 2 | OtOpcUa conventions | Server-004 |
| 3 | Concurrency & thread safety | Server-005, Server-006 |
| 4 | Error handling & resilience | Server-007, Server-008 |
| 5 | Security | Server-009, Server-010, Server-011 |
| 6 | Performance & resource management | Server-012 |
| 7 | Design-document adherence | Server-013 |
| 8 | Code organization & conventions | Server-014 |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Server-015 |
## Findings
### Server-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:1791` |
| Status | Resolved |
**Description:** `WriteNodeIdUnknown` calls itself unconditionally as its first statement, then sets `errors[i]`. Unbounded recursion with no base case overflows the stack. Called from all four `HistoryRead*` overrides whenever a HistoryRead targets a node whose `NodeId` cannot be resolved to a driver full reference. Any client issuing such a HistoryRead triggers an uncatchable `StackOverflowException` that terminates the process — a remotely-triggerable DoS.
**Recommendation:** Replace the self-call with the result-slot assignment mirroring `WriteUnsupported`/`WriteInternalError`: `results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown };` then `errors[i] = StatusCodes.BadNodeIdUnknown;`.
**Resolution:** Resolved 2026-05-22 — replaced the unconditional self-call in `WriteNodeIdUnknown` with the result-slot assignment (`results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown }`), mirroring `WriteUnsupported`/`WriteInternalError`; the helper is now `internal` for testability. Regression test `DriverNodeManagerHistoryMappingTests.WriteNodeIdUnknown_returns_BadNodeIdUnknown_without_unbounded_recursion` runs the helper on a small-stack worker thread and asserts it returns promptly with `BadNodeIdUnknown`.
### Server-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs:60-63` |
| Status | Resolved |
**Description:** `IsAllowed` does `if (decision.IsAllowed) return true; return !_strictMode;`. When a session carries resolved LDAP groups and the evaluator returns an explicit deny, lax mode (default) overrides it to `true`. The lax fallback is intended only for sessions lacking LDAP groups / missing tries, but here it also nullifies authored `NodeAcl` deny rules for fully-resolved sessions. Per-tag deny ACLs do nothing until `StrictMode` is on.
**Recommendation:** Distinguish "indeterminate / no grant" from "explicit deny." Fall through to `!_strictMode` only when indeterminate; an explicit deny returns `false` regardless of mode. Extend `AuthorizeDecision` with an `IsExplicitDeny` flag if needed.
**Resolution:** Resolved 2026-05-22 — `AuthorizationGate.IsAllowed` now switches on the evaluator's `AuthorizationVerdict`: `Allow` returns true, `Denied` (explicit deny rule matched) returns false in both strict and lax mode, and only the indeterminate `NotGranted` case falls through to `!_strictMode`. The existing `AuthorizationVerdict.Denied` tri-state member is now honoured rather than collapsed into the lax fallback. Regression tests `ExplicitDeny_LaxMode_Denies` / `ExplicitDeny_StrictMode_Denies` / `NotGranted_LaxMode_Allows` / `NotGranted_StrictMode_Denies` in `AuthorizationGateTests` cover all four verdict×mode combinations via a fixed-verdict evaluator stub.
### Server-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Phase7/RingBufferHistoryWriter.cs:96-119` |
| Status | Resolved |
**Description:** `ReadRawAsync`'s XML doc claims "newest-first," but `TagRingBuffer.Snapshot()` returns oldest-to-newest and the loop preserves that order — so results are oldest-first. Also `maxValuesPerNode` is capped against total buffer size *before* the `[startUtc, endUtc)` filter, so a paged read returns the oldest in-window samples, contradicting the doc and usual HistoryRead expectations.
**Recommendation:** Make code and doc agree on ordering (raw HistoryRead is normally ascending source-timestamp). Apply `maxValuesPerNode` to the in-window count, not the whole buffer.
**Resolution:** Resolved 2026-05-22 — corrected XML doc from "newest-first" to "oldest-first (ascending source timestamp, matching OPC UA Part 11 §6.4 raw-values default)"; moved `maxValuesPerNode` cap inside the time-window loop so the limit applies only to in-window results, not the whole buffer snapshot.
### Server-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:187-200` |
| Status | Open |
**Description:** `RoleBasedIdentity` declares its own `Display` property, but the base `UserIdentity` already has a settable `DisplayName`. `DriverNodeManager.ResolveCallUser`/`RouteScriptedAlarmMethodCalls` read the base `DisplayName`, never `Display`. Since the ctor passes only `userName` to base, `DisplayName` resolves to the username — so scripted-alarm Ack/Confirm/Shelve audit entries record the raw username, not the LDAP-resolved display name the comment promises. `Display` is dead code.
**Recommendation:** Drop `Display`; set the base `DisplayName = displayName ?? userName;`. Verify `ResolveCallUser` yields the resolved display name.
**Resolution:** _(open)_
### Server-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Alarms/AlarmConditionService.cs:166`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:303-311` |
| Status | Resolved |
**Description:** `OnValueChanged` raises `TransitionRaised` on the value-change thread; the subscriber `OnAlarmServiceTransition` drives `ConditionSink.OnTransition``alarm.ReportEvent`. `DriverNodeManager.Dispose` detaches the handler but does not synchronise against an in-flight `Invoke`. The service is process-shared across drivers, so a transition can dispatch to a `ConditionSink` whose `DriverNodeManager` is concurrently being disposed → `ReportEvent` on a torn-down node manager.
**Recommendation:** Guard `OnAlarmServiceTransition` with a `_disposed` check under `Lock` before `sink.OnTransition`. Document that handlers must tolerate invocation during their owner's disposal.
**Resolution:** Resolved 2026-05-22 — added `_nodeManagerDisposed` field; `Dispose(bool)` now sets it under `Lock` before detaching the handler; `OnAlarmServiceTransition` checks the flag under the same `Lock` and exits early, preventing forwarding to a sink after the node manager has begun disposal.
### Server-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:478-482, 1342-1348` |
| Status | Open |
**Description:** `OnReadValue`/`OnWriteValue` are synchronous stack hooks that block on async driver calls via `.GetAwaiter().GetResult()` with `CancellationToken.None`. With `MaxRequestThreadCount = 100`, a burst of reads/writes into a stalled driver pins request threads for the full pipeline timeout, exhausting the pool and stalling unrelated sessions. The call cannot be cancelled by a client timeout.
**Recommendation:** Derive a `CancellationToken` from the `OperationContext` / `TransportQuotas.OperationTimeout` so a stuck driver call is abandoned. Longer term, use the stack's async service overrides if available.
**Resolution:** _(open)_
### Server-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:179-183` |
| Status | Resolved |
**Description:** `HealthEndpointsHost` is built without a `configDbHealthy` delegate, so the default `() => true` is used — `/healthz` always reports `configDbReachable = true` and never 503s on a DB outage. `_staleConfigFlag` is also never supplied by `Program.cs`, so the stale-config signal is inert too. `/healthz` degenerates to a pure liveness probe; operators get a false-healthy during a DB outage.
**Recommendation:** Wire a real config-DB probe (cheap cached `SELECT 1`) into `HealthEndpointsHost`, and register `StaleConfigFlag` in `Program.cs`. Or move DB health to `/readyz` and drop the misleading `configDbReachable` field.
**Resolution:** Resolved 2026-05-22 — added `Func<bool>? configDbHealthy` parameter to `OpcUaApplicationHost` (defaults null, backward-compatible); `Program.cs` constructs a `DbHealthCache` that calls `CanConnectAsync` every 10 s and caches the result, then passes `() => dbHealthCache.IsHealthy`; `/healthz` now reflects real DB reachability and returns 503 on a DB outage (unless stale-config cache is warm).
### Server-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:736` |
| Status | Open |
**Description:** `RouteScriptedAlarmMethodCalls` marks a handled slot by setting `errors[i] = ServiceResult.Good`, assuming `base.Call` skips non-null *Good* error slots. The stack and `GateCallMethodRequests` only ever pre-populate *Bad* slots; the skip-on-Good assumption is not a guaranteed SDK contract. If `base.Call` re-dispatches, the engine method and the stack's built-in Part 9 handler both fire — double transition.
**Recommendation:** Verify against the pinned SDK whether `base.Call` skips Good-pre-populated slots. If not, exclude routed slots from `methodsToCall` before `base.Call`. Add a test asserting exactly-once engine transition for a routed Acknowledge.
**Resolution:** _(open)_
### Server-009
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/LdapOptions.cs:44`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:74` |
| Status | Resolved |
**Description:** `AllowInsecureLdap` defaults to `true` (and `Program.cs` reads `?? true`); `UseTls` defaults to `false`. Out of the box, usernames and plaintext passwords are bound to LDAP over an unencrypted socket. A production deployment enabling LDAP without explicitly setting `AllowInsecureLdap=false` ships credentials in clear text on the server→LDAP hop.
**Recommendation:** Default `AllowInsecureLdap` to `false` in both the property initializer and the `Program.cs` fallback. Log a startup warning when LDAP is enabled with `UseTls=false && AllowInsecureLdap=true`.
**Resolution:** Resolved 2026-05-22 — `LdapOptions.AllowInsecureLdap` now defaults to `false` (secure-by-default) and `Program.cs`'s config fallback reads `?? false`. `Program.cs` logs a startup `Log.Warning` when LDAP is enabled with `UseTls=false && AllowInsecureLdap=true`, flagging the clear-text credential hop. Regression tests in `LdapOptionsTests` assert the new secure defaults.
### Server-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:59`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:284-291` |
| Status | Resolved |
**Description:** `AutoAcceptUntrustedClientCertificates` defaults to `true` (`Program.cs` reads `?? true`). `BuildConfiguration` wires a handler that accepts any client cert failing with `BadCertificateUntrusted`. A deployment that forgets to flip the flag accepts every untrusted client cert, defeating the PKI trust list. With the always-present `None` policy, the default posture is fully open.
**Recommendation:** Default `AutoAcceptUntrustedClientCertificates` to `false`; keep auto-accept as opt-in dev convenience. `docs/security.md` already shows `false` — align code to doc.
**Resolution:** Resolved 2026-05-22 — `OpcUaServerOptions.AutoAcceptUntrustedClientCertificates` property initialiser changed from `true` to `false` (secure by default, aligning with `docs/security.md`); `Program.cs` config fallback changed from `?? true` to `?? false`.
### Server-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:322-346` |
| Status | Resolved |
**Description:** `BuildUserTokenPolicies` advertises a `UserName` token policy only when `SecurityProfile == Basic256Sha256SignAndEncrypt && Ldap.Enabled`. With the default `SecurityProfile = None` and `Ldap.Enabled = true`, the LDAP authenticator is wired but no UserName policy is advertised — clients cannot present credentials; the only path in is Anonymous. The operator's intent is silently not honoured, with no diagnostic.
**Recommendation:** Validate config at startup and warn/fail when `Ldap.Enabled = true` but no UserName policy is advertised. Allow UserName tokens on any non-None profile (they are stack-encrypted regardless, per `docs/security.md`).
**Resolution:** Resolved 2026-05-22 — `BuildUserTokenPolicies` now advertises a `UserName` token policy whenever `Ldap.Enabled && SecurityProfile != None` (previously required `== Basic256Sha256SignAndEncrypt`); `StartAsync` logs a `LogWarning` at startup when `Ldap.Enabled = true` but `SecurityProfile = None`, surfacing the misconfiguration before clients connect.
### Server-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79` |
| Status | Open |
**Description:** `ProbeAsync` creates an `IHttpClientFactory` client and mutates `client.Timeout` on every 2-second probe tick. The timeout belongs on the request or on the named-client registration, not set per call on a factory-vended instance.
**Recommendation:** Configure the timeout once via `AddHttpClient(HttpClientName).ConfigureHttpClient(...)`, or use a per-request linked `CancellationTokenSource(_options.HttpProbeTimeout)`; drop the per-call `client.Timeout` mutation.
**Resolution:** _(open)_
### Server-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:9-19`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:296-346`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:89` |
| Status | Resolved |
**Description:** `docs/security.md` documents 7 transport security profiles and `CLAUDE.md` references a `SecurityProfileResolver`. The code's `OpcUaSecurityProfile` enum has only `None` and `Basic256Sha256SignAndEncrypt`; `BuildSecurityPolicies` adds a policy only for the latter; `SecurityProfileResolver` does not exist in the repo (grep finds it only in docs). `Basic256Sha256-Sign` and all Aes profiles are unimplemented, and `Program.cs:89`'s `Enum.TryParse` silently selects `None` for an unrecognised profile string.
**Recommendation:** Reconcile code and docs — implement the missing profiles + `SecurityProfileResolver`, or trim `docs/security.md` / `CLAUDE.md` to the two supported profiles. At minimum, log a warning when a configured `SecurityProfile` fails to parse instead of silently using `None`.
**Resolution:** Resolved 2026-05-22 — replaced the silent `Enum.TryParse ?? None` fallback in `Program.cs` with a `ParseSecurityProfile` helper that produces a warning string listing supported profiles when the configured value is unrecognised; the warning is emitted via `Log.Warning` at startup before the host builds, making the misconfiguration immediately visible. Implementing the missing 5 profiles is tracked as a doc-to-code gap rather than a single finding fix.
### Server-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs` |
| Status | Open |
**Description:** `SealedBootstrap` claims in its xml-doc to "close release blocker #2" by consuming the generation-sealed cache + resilient reader + stale-config flag, but `Program.cs` registers and uses `NodeBootstrap` instead. `SealedBootstrap` is never registered in DI nor referenced by `OpcUaServerService` — it and its `StaleConfigFlag` plumbing are dead in the production wire-up; the release blocker remains open in practice.
**Recommendation:** Either register `SealedBootstrap` (with `GenerationSealedCache`/`ResilientConfigReader`/`StaleConfigFlag`) and wire `StaleConfigFlag` into the health host, or delete `SealedBootstrap` and correct the release-readiness doc.
**Resolution:** _(open)_
### Server-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26` |
| Status | Open |
**Description:** `OtOpcUaServer`'s class doc still says "PR 16 minimum-viable scope ... no security ... LDAP + security profiles are deferred." `OpcUaServerOptions`'s says "PR 17 minimum-viable scope: no LDAP, no security profiles beyond None." Both are stale — the class now does LDAP UserName auth, anonymous-role mapping, and a configurable security profile. A reader would wrongly conclude the server has no authentication.
**Recommendation:** Update both class summaries to describe current behaviour and drop the "deferred to a future PR" language.
**Resolution:** _(open)_
+53
View File
@@ -0,0 +1,53 @@
# Code Review — &lt;Module&gt;
<!-- Template for a per-module findings file. Copy to code-reviews/<Module>/findings.md.
See ../../REVIEW-PROCESS.md for the full process. The base README.md is generated
from these files by regen-readme.py — do not edit README.md by hand. -->
| Field | Value |
|---|---|
| Module | `src/<area>/ZB.MOM.WW.OtOpcUa.<Module>` |
| Reviewer | <name> |
| Review date | <YYYY-MM-DD> |
| Commit reviewed | `<short-sha>` |
| Status | Not started |
| Open findings | 0 |
## Checklist coverage
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | _pending_ |
| 2 | OtOpcUa conventions | _pending_ |
| 3 | Concurrency & thread safety | _pending_ |
| 4 | Error handling & resilience | _pending_ |
| 5 | Security | _pending_ |
| 6 | Performance & resource management | _pending_ |
| 7 | Design-document adherence | _pending_ |
| 8 | Code organization & conventions | _pending_ |
| 9 | Testing coverage | _pending_ |
| 10 | Documentation & comments | _pending_ |
## Findings
<!-- One ### entry per finding. IDs are <Module>-NNN, sequential within the module,
never reused. Findings are never deleted — close them by changing Status and
completing Resolution. -->
### <Module>-001
| Field | Value |
|---|---|
| Severity | Critical / High / Medium / Low |
| Category | one of the 10 checklist categories |
| Location | `path/to/File.cs:NN` |
| Status | Open / In Progress / Resolved / Won't Fix / Deferred |
**Description:** What is wrong and why it matters.
**Recommendation:** Concrete suggested fix.
**Resolution:** _(empty until closed; on close, record the fixing commit SHA, the date, and a one-line description of the fix)_
+78
View File
@@ -0,0 +1,78 @@
# Prompt — resolve open code-review findings
Reusable orchestration prompt for clearing the `code-reviews/` backlog. Paste it
to a fresh agent when you want the remaining findings worked through.
---
Resolve all open code-review findings (every severity), following the workflow
in `REVIEW-PROCESS.md`.
## Setup
- Read `code-reviews/README.md` for the open findings and `REVIEW-PROCESS.md`
for the workflow. Group the open findings by module.
- A module is one folder under `code-reviews/` — one `src/` project or one
`tests/` project, named with the `ZB.MOM.WW.OtOpcUa.` prefix stripped. The
module→project mapping is in `REVIEW-PROCESS.md` section 1; the build/test
commands are in `CLAUDE.md` ("Build Commands").
## Dispatch — one general-purpose subagent per module, in batches of ~5 modules
Each subagent, for every open finding in its assigned module, must:
- Verify the finding's root cause against the actual source. Do NOT trust the
finding text — if it is wrong or misclassified, re-triage it (correct the
severity/description in that module's `findings.md`) instead of forcing a fix.
- Use real TDD: write the regression test FIRST and run it to confirm it fails,
THEN implement the root-cause fix, THEN confirm it passes. (Do not use
`git stash` — parallel agents would race on the shared stash stack.)
- The regression test belongs in the reviewed project's own test project — a
finding in `src/.../ZB.MOM.WW.OtOpcUa.Driver.Galaxy` gets its test in
`tests/.../ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests`.
- Run that module's build and test suite and confirm it is green:
- Build + unit-test the affected project, e.g.
`dotnet build src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/...` and
`dotnet test tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/...`.
- A single test: `dotnet test --filter "FullyQualifiedName~MyClass.MyMethod"`.
- `*.IntegrationTests` need their Docker fixture up — bring it up with
`lmxopcua-fix up <driver> <profile>` (see `CLAUDE.md` "Docker Workflow").
DB-backed `*.Configuration.Tests`, `*.Admin.Tests`, and `*.Server.Tests`
need the central SQL Server. If a fixture/service is unavailable, document
why the suite was skipped rather than reporting it green.
- For a change that crosses project boundaries, build each affected project;
a whole-solution check is `dotnet build ZB.MOM.WW.OtOpcUa.slnx`.
- Update only that module's `code-reviews/<Module>/findings.md`: set each
resolved finding's Status to `Resolved` with a Resolution note describing the
fix (the orchestrator appends the fixing commit SHA), and update the header
"Open findings" count.
- CONSTRAINTS: edit only the source and test files needed for the assigned
module's findings, plus that module's own `findings.md`. Do NOT edit
`code-reviews/README.md`. Do NOT commit. Do NOT touch another module's
`findings.md`.
- Report a summary: each finding — root-cause confirmation, the fix, test names,
and any re-triage.
Batch so that no two subagents in the same batch write to the same project. In
particular do not review a `src/` project and its matching `*.Tests` project in
the same batch — a finding in the source project adds its regression test to
that test project.
## After each batch returns (orchestrator does this — keep your own context lean)
- Build and test every project the batch touched, using the `CLAUDE.md`
commands; confirm clean. For a wide change, `dotnet build ZB.MOM.WW.OtOpcUa.slnx`.
- Commit per module — one commit per module, message referencing the finding
IDs. Record the fixing commit SHA in each finding's Resolution.
- Regenerate the index: `python code-reviews/regen-readme.py`, then
`python code-reviews/regen-readme.py --check` to confirm it is consistent;
stage `code-reviews/README.md`. (Use `python` — the bare `python3` alias on
this box resolves to the Windows Store stub and fails.) You may stage
`README.md` with each module's commit, or commit it once per batch after the
script runs.
- Push.
## Continue
Continue batch by batch until all findings are Resolved or re-triaged. If a
finding needs a design decision, skip it and surface it rather than guessing.
+241
View File
@@ -0,0 +1,241 @@
#!/usr/bin/env python3
"""Regenerate code-reviews/README.md from the per-module findings.md files.
The per-module findings.md files are the source of truth. This script aggregates
them into the single cross-module README.md (module status + pending/closed
finding tables).
Usage:
python code-reviews/regen-readme.py # rewrite README.md
python code-reviews/regen-readme.py --check # exit 1 if stale or inconsistent
`--check` fails when README.md is out of date OR when a module's header
`Open findings` count disagrees with its finding statuses, or a finding
carries an unrecognised Status value.
"""
from __future__ import annotations
import re
import sys
from pathlib import Path
ROOT = Path(__file__).resolve().parent
README = ROOT / "README.md"
PENDING_STATUSES = {"Open", "In Progress"}
KNOWN_STATUSES = {"Open", "In Progress", "Resolved", "Won't Fix", "Deferred"}
SEVERITY_ORDER = {"Critical": 0, "High": 1, "Medium": 2, "Low": 3}
GENERATED_NOTE = (
"<!-- GENERATED FILE - do not edit by hand. "
"Regenerate with: python code-reviews/regen-readme.py -->"
)
def cell(value: str) -> str:
"""Escape a value for safe inclusion in a markdown table cell."""
return value.replace("|", "\\|").strip()
def summarize(value: str, limit: int = 240) -> str:
"""Trim a long description to a single-cell-friendly summary."""
value = value.strip()
if len(value) <= limit:
return value
return value[: limit - 1].rstrip() + ""
def first_table(text: str) -> dict[str, str]:
"""Parse the first contiguous block of '| key | value |' rows into a dict."""
rows: dict[str, str] = {}
started = False
for line in text.splitlines():
stripped = line.strip()
if stripped.startswith("|"):
started = True
cells = [c.strip() for c in stripped.strip("|").split("|")]
if len(cells) >= 2:
key, value = cells[0], cells[1]
if key and not set(key) <= {"-", ":"} and key != "Field":
rows[key] = value
elif started:
break
return rows
def parse_module(findings_path: Path) -> dict:
"""Parse one module's findings.md into its header and finding list."""
text = findings_path.read_text(encoding="utf-8")
module = findings_path.parent.name
parts = re.split(r"^##\s+Findings\s*$", text, maxsplit=1, flags=re.M)
header = first_table(parts[0])
findings: list[dict] = []
if len(parts) > 1:
for chunk in re.split(r"^###\s+", parts[1], flags=re.M)[1:]:
fid = chunk.splitlines()[0].strip()
tbl = first_table(chunk)
desc_m = re.search(
r"\*\*Description:\*\*\s*(.*?)(?=\n\*\*|\Z)", chunk, re.S
)
desc = re.sub(r"\s+", " ", desc_m.group(1)).strip() if desc_m else ""
findings.append(
{
"id": fid,
"severity": tbl.get("Severity", ""),
"category": tbl.get("Category", ""),
"location": tbl.get("Location", ""),
"status": tbl.get("Status", ""),
"description": desc,
}
)
return {"module": module, "header": header, "findings": findings}
def build_readme(modules: list[dict]) -> str:
modules = sorted(modules, key=lambda m: m["module"])
all_findings = [
dict(f, module=m["module"]) for m in modules for f in m["findings"]
]
pending = [f for f in all_findings if f["status"] in PENDING_STATUSES]
closed = [
f
for f in all_findings
if f["status"] and f["status"] not in PENDING_STATUSES
]
def sev_key(f: dict) -> tuple:
return (SEVERITY_ORDER.get(f["severity"], 9), f["id"])
pending.sort(key=sev_key)
closed.sort(key=sev_key)
out: list[str] = [
"# Code Reviews",
"",
GENERATED_NOTE,
"",
"Cross-module code review index for the OtOpcUa server codebase "
"(`lmxopcua`). The review process is defined in "
"[../REVIEW-PROCESS.md](../REVIEW-PROCESS.md).",
"",
"Each module's `findings.md` is the source of truth; this file is generated "
"from them by `regen-readme.py` and must not be edited by hand.",
"",
"## Module status",
"",
"| Module | Reviewer | Date | Commit | Status | Open | Total |",
"|---|---|---|---|---|---|---|",
]
if not modules:
out.append(
"| _no modules reviewed yet_ | | | | | | |"
)
for m in modules:
h = m["header"]
open_n = sum(
1 for f in m["findings"] if f["status"] in PENDING_STATUSES
)
out.append(
f"| [{m['module']}]({m['module']}/findings.md) "
f"| {cell(h.get('Reviewer', ''))} "
f"| {cell(h.get('Review date', ''))} "
f"| {cell(h.get('Commit reviewed', ''))} "
f"| {cell(h.get('Status', ''))} "
f"| {open_n} | {len(m['findings'])} |"
)
out += ["", "## Pending findings", ""]
out.append(
"Findings with status `Open` or `In Progress`, ordered by severity."
)
out.append("")
if pending:
out.append("| ID | Severity | Category | Location | Description |")
out.append("|---|---|---|---|---|")
for f in pending:
out.append(
f"| {cell(f['id'])} | {cell(f['severity'])} "
f"| {cell(f['category'])} | {cell(f['location'])} "
f"| {cell(summarize(f['description']))} |"
)
else:
out.append("_No pending findings._")
out += ["", "## Closed findings", ""]
out.append("Findings with status `Resolved`, `Won't Fix`, or `Deferred`.")
out.append("")
if closed:
out.append("| ID | Severity | Status | Category | Location |")
out.append("|---|---|---|---|---|")
for f in closed:
out.append(
f"| {cell(f['id'])} | {cell(f['severity'])} "
f"| {cell(f['status'])} | {cell(f['category'])} "
f"| {cell(f['location'])} |"
)
else:
out.append("_No closed findings._")
return "\n".join(out) + "\n"
def find_inconsistencies(modules: list[dict]) -> list[str]:
"""Return human-readable problems in the per-module findings.md files.
Checks that each module header's `Open findings` count agrees with its
finding statuses, and that every finding carries a known Status value.
"""
issues: list[str] = []
for m in modules:
open_n = sum(
1 for f in m["findings"] if f["status"] in PENDING_STATUSES
)
declared = m["header"].get("Open findings", "").strip()
if declared != str(open_n):
issues.append(
f"{m['module']}: header 'Open findings' = '{declared}' but "
f"{open_n} finding(s) are Open/In Progress"
)
for f in m["findings"]:
if f["status"] not in KNOWN_STATUSES:
issues.append(
f"{m['module']}: finding {f['id']} has unrecognised "
f"Status '{f['status']}'"
)
return issues
def main(argv: list[str]) -> int:
check = "--check" in argv[1:]
module_dirs = sorted(
d
for d in ROOT.iterdir()
if d.is_dir() and d.name != "_template" and (d / "findings.md").is_file()
)
modules = [parse_module(d / "findings.md") for d in module_dirs]
content = build_readme(modules)
issues = find_inconsistencies(modules)
if check:
stale = (
README.read_text(encoding="utf-8") if README.exists() else ""
) != content
for issue in issues:
print(f"inconsistent: {issue}", file=sys.stderr)
if stale:
print(
"code-reviews/README.md is stale - run regen-readme.py",
file=sys.stderr,
)
if stale or issues:
return 1
print("code-reviews/README.md is up to date and consistent.")
return 0
for issue in issues:
print(f"warning: {issue}", file=sys.stderr)
README.write_text(content, encoding="utf-8", newline="\n")
print(f"Wrote {README} ({len(modules)} modules).")
return 0
if __name__ == "__main__":
raise SystemExit(main(sys.argv))
+165
View File
@@ -0,0 +1,165 @@
#!/usr/bin/env python3
"""Tests for regen-readme.py.
Dependency-free: run with `python code-reviews/test_regen_readme.py`.
Exits 0 if all tests pass, 1 otherwise.
"""
from __future__ import annotations
import importlib.util
import tempfile
import traceback
from pathlib import Path
HERE = Path(__file__).resolve().parent
# regen-readme.py is not an importable module name (hyphen), so load it by path.
_spec = importlib.util.spec_from_file_location("regen_readme", HERE / "regen-readme.py")
regen = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(regen)
FIXTURE = """# Code Review — Demo
| Field | Value |
|---|---|
| Module | `src/Demo` |
| Reviewer | Tester |
| Review date | 2026-05-18 |
| Commit reviewed | `abc1234` |
| Status | Reviewed |
| Open findings | 1 |
## Findings
### Demo-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Demo/File.cs:10` |
| Status | Open |
**Description:** A first problem that matters.
**Recommendation:** Fix it.
**Resolution:** _(open)_
### Demo-002
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Demo/File.cs:20` |
| Status | Resolved |
**Description:** A second, minor problem.
**Recommendation:** Tidy it.
**Resolution:** Fixed in def5678 on 2026-05-18.
"""
def _parse_fixture() -> dict:
"""Write FIXTURE to a temp Demo/findings.md and parse it."""
with tempfile.TemporaryDirectory() as tmp:
path = Path(tmp) / "Demo" / "findings.md"
path.parent.mkdir()
path.write_text(FIXTURE, encoding="utf-8")
return regen.parse_module(path)
def test_first_table_skips_separator_and_field_header():
table = regen.first_table("| Field | Value |\n|---|---|\n| Severity | High |\n")
assert table == {"Severity": "High"}, table
def test_parse_module_header():
m = _parse_fixture()
assert m["module"] == "Demo", m["module"]
assert m["header"]["Reviewer"] == "Tester"
assert m["header"]["Status"] == "Reviewed"
assert m["header"]["Open findings"] == "1"
def test_parse_module_findings():
m = _parse_fixture()
assert len(m["findings"]) == 2, len(m["findings"])
first = m["findings"][0]
assert first["id"] == "Demo-001"
assert first["severity"] == "High"
assert first["category"] == "Security"
assert first["location"] == "`src/Demo/File.cs:10`"
assert first["status"] == "Open"
assert first["description"] == "A first problem that matters."
assert m["findings"][1]["status"] == "Resolved"
def test_build_readme_splits_pending_and_closed():
readme = regen.build_readme([_parse_fixture()])
assert "## Pending findings" in readme
assert "## Closed findings" in readme
pending, closed = readme.split("## Closed findings", 1)
assert "Demo-001" in pending # Open -> pending
assert "Demo-001" not in closed
assert "Demo-002" in closed # Resolved -> closed
assert "_No pending findings._" not in pending
def test_build_readme_handles_no_modules():
readme = regen.build_readme([])
assert "no modules reviewed yet" in readme
assert "_No pending findings._" in readme
assert "_No closed findings._" in readme
def test_find_inconsistencies_clean_fixture():
assert regen.find_inconsistencies([_parse_fixture()]) == []
def test_find_inconsistencies_detects_wrong_open_count():
m = _parse_fixture()
m["header"]["Open findings"] = "7"
issues = regen.find_inconsistencies([m])
assert len(issues) == 1 and "Open findings" in issues[0], issues
def test_find_inconsistencies_detects_unknown_status():
m = _parse_fixture()
m["findings"][0]["status"] = "Bogus"
issues = regen.find_inconsistencies([m])
# Wrong status also shifts the open count, so expect the status issue present.
assert any("unrecognised Status" in i for i in issues), issues
def test_summarize_truncates_long_text():
long = "x" * 500
out = regen.summarize(long)
assert len(out) <= 240 and out.endswith(""), len(out)
assert regen.summarize("short") == "short"
def main() -> int:
tests = sorted(
(name, fn)
for name, fn in globals().items()
if name.startswith("test_") and callable(fn)
)
failed = 0
for name, fn in tests:
try:
fn()
print(f"PASS {name}")
except Exception: # noqa: BLE001 - test runner reports all failures
failed += 1
print(f"FAIL {name}")
traceback.print_exc()
print(f"\n{len(tests) - failed}/{len(tests)} passed.")
return 1 if failed else 0
if __name__ == "__main__":
raise SystemExit(main())
+7 -1
View File
@@ -110,7 +110,13 @@ AB CIP ALMD) route to AVEVA Historian via the Wonderware sidecar:
present.
- `SqliteStoreAndForwardSink` queues each transition to a local
SQLite database and drains in the background via the resolved
writer.
writer. **The durability guarantee is bounded**: the queue capacity
defaults to 1,000,000 rows; under a sustained historian outage,
older non-dead-lettered rows are evicted (oldest first) to make
room for new events. The `HistorianSinkStatus.EvictedCount` counter
surfaces lifetime eviction events to the Admin UI
`/alarms/historian` diagnostics page so operators can detect silent
data loss without log scraping.
- Sidecar (PR C.1 + C.2) forwards the events to `aahClientManaged`'s
alarm-event write API; the live SDK call site is pinned during
PR D.1's deploy-rig validation.
+1 -1
View File
@@ -35,7 +35,7 @@ new ScriptedAlarmDefinition(
## Predicate evaluation
Alarm predicates reuse the same Roslyn sandbox as virtual tags — `ScriptEvaluator<AlarmPredicateContext, bool>` compiles the source, `TimedScriptEvaluator` wraps it with the configured timeout (default from `TimedScriptEvaluator.DefaultTimeout`), and `DependencyExtractor` statically harvests the tag paths the script reads. The sandbox rules (forbidden types, cancellation, logging sinks) are documented in [VirtualTags.md](VirtualTags.md); ScriptedAlarms does not redefine them.
Alarm predicates reuse the same Roslyn sandbox as virtual tags — `ScriptEvaluator<AlarmPredicateContext, bool>` compiles the source, `TimedScriptEvaluator` wraps it with the configured timeout (default from `TimedScriptEvaluator.DefaultTimeout`), and `DependencyExtractor` statically harvests the tag paths the script reads. The sandbox rules (forbidden types, cancellation, logging sinks) are documented in [VirtualTags.md](VirtualTags.md); ScriptedAlarms does not redefine them. The known memory / CPU resource limits are documented there as well.
`AlarmPredicateContext` (`AlarmPredicateContext.cs`) is the script's `ScriptContext` subclass:
+7 -1
View File
@@ -18,7 +18,13 @@ User scripts are compiled via `Microsoft.CodeAnalysis.CSharp.Scripting` against
`ScriptSandbox.Build` allow-lists exactly: `System.Private.CoreLib` (primitives + `Math` + `Convert`), `System.Linq`, `Core.Abstractions` (for `DataValueSnapshot` / `DriverDataType`), `Core.Scripting` (for `ScriptContext` + `Deadband`), `Serilog` (for `ILogger`), and the concrete context type's assembly. Pre-imported namespaces: `System`, `System.Linq`, `ZB.MOM.WW.OtOpcUa.Core.Abstractions`, `ZB.MOM.WW.OtOpcUa.Core.Scripting`.
`ForbiddenTypeAnalyzer.ForbiddenNamespacePrefixes` currently denies `System.IO`, `System.Net`, `System.Diagnostics`, `System.Reflection`, `System.Threading.Thread`, `System.Runtime.InteropServices`, `Microsoft.Win32`. Matching is by prefix against the resolved symbol's containing namespace, so `System.Net` catches `System.Net.Http.HttpClient` and every subnamespace. `System.Environment` is explicitly allowed.
`ForbiddenTypeAnalyzer.ForbiddenNamespacePrefixes` currently denies `System.IO`, `System.Net`, `System.Diagnostics`, `System.Reflection`, `System.Threading.Thread`, `System.Threading.Tasks`, `System.Runtime.InteropServices`, `Microsoft.Win32`. Matching is by prefix against the resolved symbol's containing namespace, so `System.Net` catches `System.Net.Http.HttpClient` and every subnamespace. `System.Threading.Tasks` is denied because scripts are synchronous predicates with no legitimate need to start background tasks — a `Task.Run` fan-out would outlive the per-evaluation timeout entirely (Core.Scripting-003). `System.Environment`, `System.AppDomain`, `System.GC`, and `System.Activator` are denied type-granularly via `ForbiddenFullTypeNames` because they live directly in the `System` namespace (which is otherwise allowed for primitives) — `Environment.Exit` / `FailFast` terminate the host process outright (Core.Scripting-001).
#### Known resource limits (accepted trade-offs)
The sandbox cannot prevent a script from **allocating unbounded memory**. A script calling `new byte[int.MaxValue]` repeatedly, or accumulating a large LINQ enumeration, can drive the server process to `OutOfMemoryException` before the 250 ms timeout fires. Script authoring is gated behind the Admin permission as the primary control; the test-harness preview (Stream F.4) allows operators to exercise a script before publishing. Out-of-process script execution is a v3 concern.
Similarly, **`System.Threading.Tasks` is now denied** (Core.Scripting-003), which prevents `Task.Run` / `Parallel` fan-out that would spawn background work outliving the timeout. However, a tight CPU-bound loop still runs on its thread-pool thread after `WaitAsync` returns — see the `TimedScriptEvaluator` remarks for detail. The orphaned thread is reclaimed when the Roslyn runtime eventually returns; in practice the operator fixes the script once the structured timeout warning appears in `scripts-*.log`.
### Compile cache (`CompiledScriptCache<TContext, TResult>`)
+29 -25
View File
@@ -212,36 +212,40 @@ x64, which is not bitness-constrained like the worker). C.1 is independently
unblockable from A.2 if the goal is to wire up the scripted-alarm historian
path.
**Current state**:
**Current state (DONE — code)**:
`SdkAlarmHistorianWriteBackend` in `src\MxGateway.Worker\MxAccess\` is a
placeholder returning `RetryPlease`. The lmxopcua sidecar's `WriteAlarmEvents`
IPC slot is defined in `Ipc\Contracts.cs` but `Program.cs` constructs
`HistorianFrameHandler` without an `alarmWriter` (line 57 per the alarms plan).
The `IAlarmEventWriter` interface exists; only the production implementation
and the consumer wiring are missing.
C.1 shipped. `SdkAlarmHistorianWriteBackend.WriteBatchAsync` writes through the
real SDK entry point — **`HistorianAccess.AddStreamedValue(HistorianEvent, out
HistorianAccessError)`** in `aahClientManaged` — pinned 2026-05-18 by
decompiling the installed SDK. `Program.cs` and `Install-Services.ps1` were
already wired in the PR C.1 scaffolding. Two corrections to the assumptions
this doc was written under:
**What it needs**:
- **There is no `ArchestrAAlarmsAndEvents.SDK` writer.** That assembly
(`ArchestrAAlarmsAndEvents.SDK.Common.dll`, the only one installed) is a WCF
query-proxy base — no `AlarmHistorianWriter` type. The write path is the
`aahClientManaged` `HistorianAccess` surface.
- **The write path needs its own connection.** The query-side
`HistorianDataSource` opens `ReadOnly` sessions; `AddStreamedValue` on a
read-only session fails with `WriteToReadOnlyFile`.
`SdkAlarmHistorianWriteBackend` opens a dedicated `ReadOnly=false` connection
and shares only `HistorianClusterEndpointPicker` (not the connection object).
1. New `AahClientManagedAlarmEventWriter.cs` implementing `IAlarmEventWriter`
(defined in `Ipc\HistorianFrameHandler.cs`). Calls `aahClientManaged`'s
alarm-event write API — same path v1's `GalaxyHistorianWriter` used.
Uses `HistorianClusterEndpointPicker` for multi-node routing.
Maps `MxStatus` write outcomes to `HistorianWriteOutcome` enum
(Ack / PermanentFail / RetryPlease).
**What it needed** (all done):
2. `Program.cs` — build `AahClientManagedAlarmEventWriter` next to the
existing `BuildHistorian()` call; pass it to `HistorianFrameHandler`.
Gate behind `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED` env var (default `true`
when `OTOPCUA_HISTORIAN_ENABLED=true`).
1. `SdkAlarmHistorianWriteBackend` builds a `HistorianEvent` per
`AlarmHistorianEventDto`, calls `AddStreamedValue`, and maps
`HistorianAccessError.ErrorValue` codes through
`AahClientManagedAlarmEventWriter.MapOutcome` (Ack / PermanentFail /
RetryPlease). `HistorianClusterEndpointPicker` drives multi-node failover.
2. `Program.cs``BuildAlarmWriter()` constructs the backend gated behind
`OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED`.
3. `Install-Services.ps1` — env var present in the install-time block.
3. `Install-Services.ps1` — add the new env var to the install-time block.
**What blocks C.1**: access to the `aahClientManaged` SDK on the dev box
(confirmed available per `project_aveva_platform_installed.md` — AVEVA
Historian SDK is present). C.1 can proceed without A.2 since the sidecar's
`aahClientManaged` is x64 and does not share the worker's x86 bitness
constraint.
**What remains for C.1**: only the live-rig write smoke — the `Live_*` tests
in `SdkAlarmHistorianWriteBackendTests` stay `Skip`-gated until D.1 confirms a
round-trip against a real AVEVA Historian, including the exact mandatory
`HistorianEvent` field set.
**Tests to write**:
+2 -2
View File
@@ -138,9 +138,9 @@ All three are verified closed in the 2026-04-23 exit-gate audit:
These are real open items, not issues with the plan reconciliation.
### Gap 1 — OPC UA method-call dispatch for scripted alarm Ack/Confirm/Shelve (Stream G / C.6)
### Gap 1 — OPC UA method-call dispatch for scripted alarm methods (Stream G / C.6) — CLOSED
`DriverNodeManager.MethodCall` does not route OPC UA `Acknowledge` / `Confirm` / `OneShotShelve` / `TimedShelve` / `Unshelve` / `AddComment` method invocations to the `ScriptedAlarmEngine`. Operators can acknowledge scripted alarms through the Admin UI today; OPC UA HMI clients expecting to use Part 9 method nodes directly cannot. Explicit in `phase-7-e2e-smoke.md` §"Known limitations".
All Part 9 alarm methods now route to the `ScriptedAlarmEngine`. `Acknowledge` / `Confirm` / `AddComment` route via `DriverNodeManager.RouteScriptedAlarmMethodCalls` (task #24 + follow-up); `AddComment` gates at the `AlarmAcknowledge` tier. `OneShotShelve` / `TimedShelve` / `Unshelve` route via the native `AlarmConditionState.OnShelve` / `OnTimedUnshelve` hooks wired in `MarkAsAlarmCondition`, with the per-instance shelve method NodeIds indexed so the Call gate resolves them to `OpcUaOperation.AlarmShelve`.
### Gap 2 — Admin UI: no `/virtual-tags` tab or form (Stream F.2)
+54
View File
@@ -0,0 +1,54 @@
# Loose ends
State as of 2026-05-18, after the #9#29 task-list run. Everything on the
formal task list is shipped except #20; the items below are what genuinely
remains, plus follow-ups surfaced during the run.
## Open task
- **#20 — D.1 dev-rig rollout smoke.** A full 3-service deployment
(gateway + worker + server + Wonderware historian sidecar): deploy the
refreshed binaries, run `scripts/install/Refresh-Services.ps1`, exercise
alarms end-to-end, and capture the rollout artifact. The code blockers
were cleared by #18; the act itself needs the physical AVEVA dev rig and
cannot be produced from a dev box. Runbook context in
`docs/plans/alarms-worker-wiring-plan.md`.
## Follow-ups surfaced during the run
- **~~C.1 live SDK binding.~~** DONE (code). `SdkAlarmHistorianWriteBackend`
(`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware/Backend/`) now
writes via the real entry point `HistorianAccess.AddStreamedValue(HistorianEvent,
out error)` in `aahClientManaged`. Two plan corrections found while pinning it:
(a) `ArchestrAAlarmsAndEvents.SDK` has no writer — it's a WCF query proxy;
(b) writes need their own `ReadOnly=false` connection, not the shared read
pool. Remaining: the live-rig write smoke (the `Live_*` tests are still
`Skip`-gated) — folds into #20 / D.1.
- **~~#24 Shelve-method routing.~~** DONE. Acknowledge / Confirm already
routed; OneShotShelve / TimedShelve / Unshelve now route via the native
`AlarmConditionState.OnShelve` / `OnTimedUnshelve` hooks wired in
`DriverNodeManager.MarkAsAlarmCondition` (scripted alarms get a shelvable
`ShelvedStateMachine` subtree created before `alarm.Create`). The three
per-instance shelve method NodeIds are indexed so the Call gate resolves
them to `OpcUaOperation.AlarmShelve`. `AddComment` also now routes to the
engine (gated at the `AlarmAcknowledge` tier) — `phase-7-status.md` Gap 1
is fully closed. Remaining: address-space materialisation of the shelve
method nodes is best confirmed by a live OPC UA browse (pairs with the
G6 / D.1 rig steps).
- **mxaccessgw alarm epic branch.** The alarm subsystem work (A.2/A.3/A.4
+ the two production-gap fixes from #18) lives on the mxaccessgw branch
`docs/alarm-client-wm-app-finding`. It is NOT merged to mxaccessgw's main.
Whether/when to merge the alarm epic to main is an open release decision.
- **#15 operator/lab GA gates.** Two v2 GA gates are manual lab steps, not
automatable here: the OPC UA CTT (Compliance Test Tool) pass and the
deployment-checklist signoff. Documented in
`docs/plans/v2-ga-lab-gates-plan.md`.
## Done — for reference
The 5 Phase 7 gaps discovered mid-run (#24#28) were all completed and
merged; no Phase 7 gaps remain open. Add any new follow-ups above as they
are spun out.
+20
View File
@@ -0,0 +1,20 @@
# Verifies code-reviews/README.md is regenerated from, and consistent with, the
# per-module findings.md files. Intended as a CI / pre-commit gate.
#
# Exits non-zero when README.md is stale, when a module header's "Open findings"
# count disagrees with its finding statuses, or when a finding carries an
# unrecognised Status value. See REVIEW-PROCESS.md section 5.
[CmdletBinding()]
param()
Set-StrictMode -Version Latest
$ErrorActionPreference = "Stop"
$repoRoot = Resolve-Path (Join-Path $PSScriptRoot "..")
$script = Join-Path $repoRoot "code-reviews/regen-readme.py"
# The bare `python3` alias on this platform resolves to the Windows Store stub;
# `python` is the real interpreter.
& python $script --check
exit $LASTEXITCODE
@@ -1,7 +1,9 @@
using System.Threading.Channels;
using CliFx.Attributes;
using CliFx.Infrastructure;
using ZB.MOM.WW.OtOpcUa.Client.CLI.Helpers;
using ZB.MOM.WW.OtOpcUa.Client.Shared;
using ZB.MOM.WW.OtOpcUa.Client.Shared.Models;
namespace ZB.MOM.WW.OtOpcUa.Client.CLI.Commands;
@@ -49,19 +51,33 @@ public class AlarmsCommand : CommandBase
var sourceNodeId = NodeIdParser.Parse(NodeId);
service.AlarmEvent += (_, e) =>
// Channel serialises SDK notification-thread writes to the main async loop so
// that concurrent alarm callbacks never interleave on the shared TextWriter.
var outputChannel = Channel.CreateUnbounded<string>(
new UnboundedChannelOptions { SingleReader = true });
void AlarmEventHandler(object? sender, AlarmEventArgs e)
{
console.Output.WriteLine($"[{e.Time:O}] ALARM {e.SourceName}");
console.Output.WriteLine($" Condition: {e.ConditionName}");
var activeStr = e.ActiveState ? "Active" : "Inactive";
var ackedStr = e.AckedState ? "Acknowledged" : "Unacknowledged";
console.Output.WriteLine($" State: {activeStr}, {ackedStr}");
console.Output.WriteLine($" Severity: {e.Severity}");
if (!string.IsNullOrEmpty(e.Message))
console.Output.WriteLine($" Message: {e.Message}");
console.Output.WriteLine($" Retain: {e.Retain}");
console.Output.WriteLine();
};
try
{
var activeStr = e.ActiveState ? "Active" : "Inactive";
var ackedStr = e.AckedState ? "Acknowledged" : "Unacknowledged";
outputChannel.Writer.TryWrite($"[{e.Time:O}] ALARM {e.SourceName}");
outputChannel.Writer.TryWrite($" Condition: {e.ConditionName}");
outputChannel.Writer.TryWrite($" State: {activeStr}, {ackedStr}");
outputChannel.Writer.TryWrite($" Severity: {e.Severity}");
if (!string.IsNullOrEmpty(e.Message))
outputChannel.Writer.TryWrite($" Message: {e.Message}");
outputChannel.Writer.TryWrite($" Retain: {e.Retain}");
outputChannel.Writer.TryWrite(string.Empty);
}
catch
{
// Never let handler exceptions escape into the SDK callback.
}
}
service.AlarmEvent += AlarmEventHandler;
await service.SubscribeAlarmsAsync(sourceNodeId, Interval, ct);
await console.Output.WriteLineAsync(
@@ -78,6 +94,14 @@ public class AlarmsCommand : CommandBase
await console.Output.WriteLineAsync($"Condition refresh not supported: {ex.Message}");
}
// Drain the output channel on the main thread until cancellation fires.
using var drainCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
var drainTask = Task.Run(async () =>
{
await foreach (var line in outputChannel.Reader.ReadAllAsync(drainCts.Token))
await console.Output.WriteLineAsync(line);
}, CancellationToken.None);
// Wait until cancellation
try
{
@@ -88,6 +112,12 @@ public class AlarmsCommand : CommandBase
// Expected on Ctrl+C
}
// Stop accepting new notifications before writing final output.
service.AlarmEvent -= AlarmEventHandler;
outputChannel.Writer.Complete();
await drainCts.CancelAsync();
try { await drainTask; } catch (OperationCanceledException) { }
await service.UnsubscribeAlarmsAsync();
await console.Output.WriteLineAsync("Unsubscribed.");
}
@@ -1,3 +1,4 @@
using System.Globalization;
using CliFx.Attributes;
using CliFx.Infrastructure;
using Opc.Ua;
@@ -27,13 +28,13 @@ public class HistoryReadCommand : CommandBase
/// <summary>
/// Gets the optional history start time string supplied by the operator.
/// </summary>
[CommandOption("start", Description = "Start time (ISO 8601 or date string, default: 24 hours ago)")]
[CommandOption("start", Description = "Start time in ISO 8601 UTC format, e.g. 2026-01-15T08:00:00Z (default: 24 hours ago)")]
public string? StartTime { get; init; }
/// <summary>
/// Gets the optional history end time string supplied by the operator.
/// </summary>
[CommandOption("end", Description = "End time (ISO 8601 or date string, default: now)")]
[CommandOption("end", Description = "End time in ISO 8601 UTC format, e.g. 2026-01-15T09:00:00Z (default: now)")]
public string? EndTime { get; init; }
/// <summary>
@@ -70,10 +71,12 @@ public class HistoryReadCommand : CommandBase
var nodeId = NodeIdParser.ParseRequired(NodeId);
var start = string.IsNullOrEmpty(StartTime)
? DateTime.UtcNow.AddHours(-24)
: DateTime.Parse(StartTime).ToUniversalTime();
: DateTime.Parse(StartTime, CultureInfo.InvariantCulture,
DateTimeStyles.AssumeUniversal | DateTimeStyles.AdjustToUniversal);
var end = string.IsNullOrEmpty(EndTime)
? DateTime.UtcNow
: DateTime.Parse(EndTime).ToUniversalTime();
: DateTime.Parse(EndTime, CultureInfo.InvariantCulture,
DateTimeStyles.AssumeUniversal | DateTimeStyles.AdjustToUniversal);
IReadOnlyList<DataValue> values;
@@ -1,9 +1,11 @@
using System.Collections.Concurrent;
using System.Threading.Channels;
using CliFx.Attributes;
using CliFx.Infrastructure;
using Opc.Ua;
using ZB.MOM.WW.OtOpcUa.Client.CLI.Helpers;
using ZB.MOM.WW.OtOpcUa.Client.Shared;
using ZB.MOM.WW.OtOpcUa.Client.Shared.Models;
namespace ZB.MOM.WW.OtOpcUa.Client.CLI.Commands;
@@ -63,19 +65,35 @@ public class SubscribeCommand : CommandBase
var everBad = new ConcurrentDictionary<string, byte>();
var displayNameByNodeId = targets.ToDictionary(t => t.nodeId.ToString(), t => t.displayPath);
service.DataChanged += (_, e) =>
// Channel serialises notification-thread writes to the main async loop so that
// concurrent SDK callbacks and main-thread summary output never interleave on
// the shared TextWriter.
var outputChannel = Channel.CreateUnbounded<string>(
new UnboundedChannelOptions { SingleReader = true });
void DataChangedHandler(object? sender, DataChangedEventArgs e)
{
var key = e.NodeId.ToString();
lastStatus[key] = (e.Value.StatusCode, DateTime.UtcNow, e.Value.Value);
updateCount.AddOrUpdate(key, 1, (_, v) => v + 1);
if (!StatusCode.IsGood(e.Value.StatusCode))
everBad.TryAdd(key, 0);
if (!Quiet)
try
{
console.Output.WriteLine(
$"[{e.Value.SourceTimestamp:O}] {displayNameByNodeId.GetValueOrDefault(key, key)} = {e.Value.Value} ({e.Value.StatusCode})");
var key = e.NodeId.ToString();
lastStatus[key] = (e.Value.StatusCode, DateTime.UtcNow, e.Value.Value);
updateCount.AddOrUpdate(key, 1, (_, v) => v + 1);
if (!StatusCode.IsGood(e.Value.StatusCode))
everBad.TryAdd(key, 0);
if (!Quiet)
{
var line =
$"[{e.Value.SourceTimestamp:O}] {displayNameByNodeId.GetValueOrDefault(key, key)} = {e.Value.Value} ({e.Value.StatusCode})";
outputChannel.Writer.TryWrite(line);
}
}
};
catch
{
// Never let handler exceptions escape into the SDK callback.
}
}
service.DataChanged += DataChangedHandler;
var subscribed = 0;
foreach (var (nodeId, _) in targets)
@@ -94,6 +112,14 @@ public class SubscribeCommand : CommandBase
await console.Output.WriteLineAsync(
$"Subscribed to {subscribed}/{targets.Count} nodes (interval: {Interval}ms). Press Ctrl+C to stop and print summary.");
// Drain the output channel on the main thread until cancellation fires.
using var drainCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
var drainTask = Task.Run(async () =>
{
await foreach (var line in outputChannel.Reader.ReadAllAsync(drainCts.Token))
await console.Output.WriteLineAsync(line);
}, CancellationToken.None);
try
{
if (DurationSeconds > 0)
@@ -105,6 +131,12 @@ public class SubscribeCommand : CommandBase
{
}
// Stop accepting new notifications before writing the summary.
service.DataChanged -= DataChangedHandler;
outputChannel.Writer.Complete();
await drainCts.CancelAsync();
try { await drainTask; } catch (OperationCanceledException) { }
// Summary
var summary = new List<string>();
summary.Add("");
@@ -127,10 +159,10 @@ public class SubscribeCommand : CommandBase
}
var neverWentBad = targets
.Where(t => !everBad.ContainsKey(t.nodeId.ToString()))
.Where(t => lastStatus.ContainsKey(t.nodeId.ToString()) && !everBad.ContainsKey(t.nodeId.ToString()))
.Select(t => t.displayPath)
.ToList();
var didGoBad = targets.Count - neverWentBad.Count;
var didGoBad = targets.Count(t => everBad.ContainsKey(t.nodeId.ToString()));
summary.Add($"Total subscribed: {targets.Count}");
summary.Add($" Ever went BAD during window: {didGoBad}");
@@ -10,23 +10,53 @@ public static class ValueConverter
/// Converts a raw string value into the runtime type expected by the target node.
/// </summary>
/// <param name="rawValue">The raw string supplied by the user.</param>
/// <param name="currentValue">The current node value used to infer the target type. May be null.</param>
/// <param name="currentValue">
/// The current node value used to infer the target type. When <c>null</c> the raw string
/// is returned unchanged; callers should validate this case before writing.
/// </param>
/// <returns>A typed value suitable for an OPC UA write request.</returns>
/// <exception cref="FormatException">
/// Thrown with a descriptive message when <paramref name="rawValue"/> cannot be parsed
/// into the type inferred from <paramref name="currentValue"/>.
/// </exception>
public static object ConvertValue(string rawValue, object? currentValue)
{
return currentValue switch
try
{
bool => bool.Parse(rawValue),
byte => byte.Parse(rawValue),
short => short.Parse(rawValue),
ushort => ushort.Parse(rawValue),
int => int.Parse(rawValue),
uint => uint.Parse(rawValue),
long => long.Parse(rawValue),
ulong => ulong.Parse(rawValue),
float => float.Parse(rawValue),
double => double.Parse(rawValue),
_ => rawValue
return currentValue switch
{
bool => ParseBool(rawValue),
byte => byte.Parse(rawValue),
short => short.Parse(rawValue),
ushort => ushort.Parse(rawValue),
int => int.Parse(rawValue),
uint => uint.Parse(rawValue),
long => long.Parse(rawValue),
ulong => ulong.Parse(rawValue),
float => float.Parse(rawValue),
double => double.Parse(rawValue),
_ => rawValue
};
}
catch (Exception ex) when (ex is FormatException or OverflowException)
{
var targetType = currentValue?.GetType().Name ?? "unknown";
throw new FormatException(
$"Cannot convert value \"{rawValue}\" to target type {targetType}: {ex.Message}", ex);
}
}
/// <summary>
/// Parses a boolean from common string representations including numeric and word forms.
/// Accepts: <c>true</c>/<c>false</c>, <c>1</c>/<c>0</c>, <c>yes</c>/<c>no</c> (case-insensitive).
/// </summary>
private static bool ParseBool(string rawValue)
{
return rawValue.Trim().ToLowerInvariant() switch
{
"true" or "1" or "yes" => true,
"false" or "0" or "no" => false,
_ => throw new FormatException($"String '{rawValue}' was not recognized as a valid Boolean.")
};
}
}
@@ -15,9 +15,20 @@ public sealed class OpcUaClientService : IOpcUaClientService
{
private static readonly ILogger Logger = Log.ForContext<OpcUaClientService>();
// Guards all access to the subscription-bookkeeping state below
// (_activeDataSubscriptions and _activeAlarmSubscription). The dictionary
// and tuple are mutated from the caller thread, the keep-alive failover
// path, and DisconnectAsync, so every read/write must be inside this lock.
private readonly object _subscriptionLock = new();
// Track active data subscriptions for replay after failover
private readonly Dictionary<string, (NodeId NodeId, int IntervalMs, uint Handle)> _activeDataSubscriptions = new();
// Re-entry guard for HandleKeepAliveFailureAsync. The OPC UA stack raises
// KeepAlive repeatedly while a session is down; only one failover loop may
// run at a time. 0 = idle, 1 = failover in progress (Interlocked-managed).
private int _failoverInProgress;
private readonly IApplicationConfigurationFactory _configFactory;
private readonly IEndpointDiscovery _endpointDiscovery;
@@ -146,8 +157,12 @@ public sealed class OpcUaClientService : IOpcUaClientService
}
finally
{
_activeDataSubscriptions.Clear();
_activeAlarmSubscription = null;
lock (_subscriptionLock)
{
_activeDataSubscriptions.Clear();
_activeAlarmSubscription = null;
}
CurrentConnectionInfo = null;
TransitionState(ConnectionState.Disconnected, endpointUrl);
}
@@ -172,6 +187,9 @@ public sealed class OpcUaClientService : IOpcUaClientService
if (value is string rawString)
{
var currentDataValue = await _session!.ReadValueAsync(nodeId, ct);
if (!StatusCode.IsGood(currentDataValue.StatusCode) || currentDataValue.Value == null)
throw new InvalidOperationException(
$"Cannot infer target type for node {nodeId}: read returned status {currentDataValue.StatusCode} with no value. Provide a typed value instead of a string.");
typedValue = ValueConverter.ConvertValue(rawString, currentDataValue.Value);
}
@@ -223,15 +241,22 @@ public sealed class OpcUaClientService : IOpcUaClientService
ThrowIfNotConnected();
var nodeIdStr = nodeId.ToString();
if (_activeDataSubscriptions.ContainsKey(nodeIdStr))
return; // Already subscribed
lock (_subscriptionLock)
{
if (_activeDataSubscriptions.ContainsKey(nodeIdStr))
return; // Already subscribed
}
if (_dataSubscription == null) _dataSubscription = await _session!.CreateSubscriptionAsync(intervalMs, ct);
var handle = await _dataSubscription.AddDataChangeMonitoredItemAsync(
nodeId, intervalMs, OnDataChangeNotification, ct);
_activeDataSubscriptions[nodeIdStr] = (nodeId, intervalMs, handle);
lock (_subscriptionLock)
{
_activeDataSubscriptions[nodeIdStr] = (nodeId, intervalMs, handle);
}
Logger.Debug("Subscribed to data changes on {NodeId}", nodeId);
}
@@ -241,12 +266,20 @@ public sealed class OpcUaClientService : IOpcUaClientService
ThrowIfDisposed();
var nodeIdStr = nodeId.ToString();
if (!_activeDataSubscriptions.TryGetValue(nodeIdStr, out var sub))
return; // Not subscribed, safe to ignore
(NodeId NodeId, int IntervalMs, uint Handle) sub;
lock (_subscriptionLock)
{
if (!_activeDataSubscriptions.TryGetValue(nodeIdStr, out sub))
return; // Not subscribed, safe to ignore
}
if (_dataSubscription != null) await _dataSubscription.RemoveMonitoredItemAsync(sub.Handle, ct);
_activeDataSubscriptions.Remove(nodeIdStr);
lock (_subscriptionLock)
{
_activeDataSubscriptions.Remove(nodeIdStr);
}
Logger.Debug("Unsubscribed from data changes on {NodeId}", nodeId);
}
@@ -267,7 +300,11 @@ public sealed class OpcUaClientService : IOpcUaClientService
await _alarmSubscription.AddEventMonitoredItemAsync(
monitorNode, intervalMs, filter, OnAlarmEventNotification, ct);
_activeAlarmSubscription = (sourceNodeId, intervalMs);
lock (_subscriptionLock)
{
_activeAlarmSubscription = (sourceNodeId, intervalMs);
}
Logger.Debug("Subscribed to alarm events on {NodeId}", monitorNode);
}
@@ -281,7 +318,12 @@ public sealed class OpcUaClientService : IOpcUaClientService
await _alarmSubscription.DeleteAsync(ct);
_alarmSubscription = null;
_activeAlarmSubscription = null;
lock (_subscriptionLock)
{
_activeAlarmSubscription = null;
}
Logger.Debug("Unsubscribed from alarm events");
}
@@ -349,10 +391,17 @@ public sealed class OpcUaClientService : IOpcUaClientService
var redundancySupportValue =
await _session!.ReadValueAsync(VariableIds.Server_ServerRedundancy_RedundancySupport, ct);
var redundancyMode = ((RedundancySupport)(int)redundancySupportValue.Value).ToString();
RedundancySupport redundancySupport;
if (StatusCode.IsGood(redundancySupportValue.StatusCode) && redundancySupportValue.Value != null)
redundancySupport = (RedundancySupport)Convert.ToInt32(redundancySupportValue.Value);
else
redundancySupport = RedundancySupport.None;
var redundancyMode = redundancySupport.ToString();
var serviceLevelValue = await _session.ReadValueAsync(VariableIds.Server_ServiceLevel, ct);
var serviceLevel = (byte)serviceLevelValue.Value;
var serviceLevel = StatusCode.IsGood(serviceLevelValue.StatusCode) && serviceLevelValue.Value != null
? Convert.ToByte(serviceLevelValue.Value)
: (byte)0;
string[] serverUris = [];
try
@@ -393,8 +442,13 @@ public sealed class OpcUaClientService : IOpcUaClientService
_dataSubscription?.Dispose();
_alarmSubscription?.Dispose();
_session?.Dispose();
_activeDataSubscriptions.Clear();
_activeAlarmSubscription = null;
lock (_subscriptionLock)
{
_activeDataSubscriptions.Clear();
_activeAlarmSubscription = null;
}
CurrentConnectionInfo = null;
_state = ConnectionState.Disconnected;
}
@@ -430,6 +484,26 @@ public sealed class OpcUaClientService : IOpcUaClientService
}
private async Task HandleKeepAliveFailureAsync()
{
// Serialize failover: the OPC UA stack raises KeepAlive repeatedly
// while a session is down, so multiple bad keep-alives can fire before
// the first failover loop finishes. CompareExchange atomically claims
// the failover slot; a re-entrant call sees 1 and returns immediately,
// guaranteeing exactly one failover loop runs at a time.
if (Interlocked.CompareExchange(ref _failoverInProgress, 1, 0) != 0)
return;
try
{
await RunFailoverAsync();
}
finally
{
Interlocked.Exchange(ref _failoverInProgress, 0);
}
}
private async Task RunFailoverAsync()
{
if (_state == ConnectionState.Reconnecting || _state == ConnectionState.Disconnected)
return;
@@ -498,33 +572,43 @@ public sealed class OpcUaClientService : IOpcUaClientService
private async Task ReplaySubscriptionsAsync()
{
// Replay data subscriptions
if (_activeDataSubscriptions.Count > 0)
// Snapshot the bookkeeping state under the lock, then clear it so the
// replayed handles can be recorded fresh as each monitored item is
// re-created. Awaited calls run outside the lock.
List<KeyValuePair<string, (NodeId NodeId, int IntervalMs, uint Handle)>> subscriptions;
(NodeId? SourceNodeId, int IntervalMs)? alarmSubscription;
lock (_subscriptionLock)
{
var subscriptions = _activeDataSubscriptions.ToList();
subscriptions = _activeDataSubscriptions.ToList();
alarmSubscription = _activeAlarmSubscription;
_activeDataSubscriptions.Clear();
foreach (var (nodeIdStr, (nodeId, intervalMs, _)) in subscriptions)
try
{
if (_dataSubscription == null)
_dataSubscription = await _session!.CreateSubscriptionAsync(intervalMs, CancellationToken.None);
var handle = await _dataSubscription.AddDataChangeMonitoredItemAsync(
nodeId, intervalMs, OnDataChangeNotification, CancellationToken.None);
_activeDataSubscriptions[nodeIdStr] = (nodeId, intervalMs, handle);
}
catch (Exception ex)
{
Logger.Warning(ex, "Failed to replay data subscription for {NodeId}", nodeIdStr);
}
_activeAlarmSubscription = null;
}
// Replay data subscriptions
foreach (var (nodeIdStr, (nodeId, intervalMs, _)) in subscriptions)
try
{
if (_dataSubscription == null)
_dataSubscription = await _session!.CreateSubscriptionAsync(intervalMs, CancellationToken.None);
var handle = await _dataSubscription.AddDataChangeMonitoredItemAsync(
nodeId, intervalMs, OnDataChangeNotification, CancellationToken.None);
lock (_subscriptionLock)
{
_activeDataSubscriptions[nodeIdStr] = (nodeId, intervalMs, handle);
}
}
catch (Exception ex)
{
Logger.Warning(ex, "Failed to replay data subscription for {NodeId}", nodeIdStr);
}
// Replay alarm subscription
if (_activeAlarmSubscription.HasValue)
if (alarmSubscription.HasValue)
{
var (sourceNodeId, intervalMs) = _activeAlarmSubscription.Value;
_activeAlarmSubscription = null;
var (sourceNodeId, intervalMs) = alarmSubscription.Value;
try
{
var monitorNode = sourceNodeId ?? ObjectIds.Server;
@@ -532,7 +616,11 @@ public sealed class OpcUaClientService : IOpcUaClientService
var filter = CreateAlarmEventFilter();
await _alarmSubscription.AddEventMonitoredItemAsync(
monitorNode, intervalMs, filter, OnAlarmEventNotification, CancellationToken.None);
_activeAlarmSubscription = (sourceNodeId, intervalMs);
lock (_subscriptionLock)
{
_activeAlarmSubscription = (sourceNodeId, intervalMs);
}
}
catch (Exception ex)
{
@@ -549,7 +637,7 @@ public sealed class OpcUaClientService : IOpcUaClientService
private void OnAlarmEventNotification(EventFieldList eventFields)
{
var fields = eventFields.EventFields;
if (fields == null || fields.Count < 6)
if (fields == null || fields.Count < 1)
return;
var eventId = fields.Count > 0 ? fields[0].Value as byte[] : null;
@@ -578,6 +666,8 @@ public sealed class OpcUaClientService : IOpcUaClientService
// Fallback: read InAlarm/Acked from condition node Galaxy attributes
// when the server doesn't populate standard event fields.
// Must run on a background thread to avoid deadlocking the notification thread.
// Capture the session reference now; skip supplemental reads if the session has
// been replaced by a concurrent failover before the Task.Run body executes.
if (ackedField == null && activeField == null && conditionNodeId != null && _session != null)
{
var session = _session;
@@ -585,6 +675,11 @@ public sealed class OpcUaClientService : IOpcUaClientService
var capturedMessage = message;
_ = Task.Run(async () =>
{
// If the session was replaced by a failover before we started reading,
// skip the supplemental reads to avoid hitting a disposed session.
if (!ReferenceEquals(session, _session))
return;
try
{
var inAlarmValue = await session.ReadValueAsync(NodeId.Parse($"{capturedConditionNodeId}.InAlarm"));
@@ -609,9 +704,16 @@ public sealed class OpcUaClientService : IOpcUaClientService
}
catch { /* DescAttrName may not exist */ }
}
catch (ObjectDisposedException)
{
// Session was disposed during supplemental reads (concurrent failover);
// skip the event rather than delivering stale/default states.
Logger.Debug("Supplemental alarm read skipped — session disposed during failover for {ConditionNodeId}", capturedConditionNodeId);
return;
}
catch
{
// Supplemental read failed; use defaults
// Other supplemental read failure; deliver event with defaults
}
AlarmEvent?.Invoke(this, new AlarmEventArgs(
@@ -17,11 +17,6 @@ public sealed class UserSettings
/// </summary>
public string? Username { get; set; }
/// <summary>
/// Gets or sets the persisted password for authenticated sessions.
/// </summary>
public string? Password { get; set; }
/// <summary>
/// Gets or sets the transport security mode selected by the user.
/// </summary>
@@ -215,6 +215,16 @@ public partial class AlarmsViewModel : ObservableObject
ActiveAlarmCount = 0;
}
/// <summary>
/// Re-hooks event handlers to the service after a server-side reconnect.
/// Safe to call when already attached (duplicate += is a no-op in .NET multicast delegates).
/// </summary>
public void Reattach()
{
_service.AlarmEvent -= OnAlarmEvent;
_service.AlarmEvent += OnAlarmEvent;
}
/// <summary>
/// Unhooks event handlers from the service.
/// </summary>
@@ -73,7 +73,7 @@ public partial class HistoryViewModel : ObservableObject
{
if (string.IsNullOrEmpty(SelectedNodeId)) return;
IsLoading = true;
_dispatcher.Post(() => IsLoading = true);
_dispatcher.Post(() => Results.Clear());
try
@@ -10,7 +10,7 @@ namespace ZB.MOM.WW.OtOpcUa.Client.UI.ViewModels;
/// <summary>
/// Main window ViewModel coordinating all panels.
/// </summary>
public partial class MainWindowViewModel : ObservableObject
public partial class MainWindowViewModel : ObservableObject, IDisposable
{
private readonly IUiDispatcher _dispatcher;
private readonly IOpcUaClientServiceFactory _factory;
@@ -166,6 +166,8 @@ public partial class MainWindowViewModel : ObservableObject
{
case ConnectionState.Connected:
StatusMessage = $"Connected to {EndpointUrl}";
Subscriptions?.Reattach();
Alarms?.Reattach();
break;
case ConnectionState.Reconnecting:
StatusMessage = "Reconnecting...";
@@ -177,6 +179,8 @@ public partial class MainWindowViewModel : ObservableObject
StatusMessage = "Disconnected";
SessionLabel = string.Empty;
RedundancyInfo = null;
Subscriptions?.Teardown();
Alarms?.Teardown();
BrowseTree?.Clear();
ReadWrite?.Clear();
Subscriptions?.Clear();
@@ -252,7 +256,7 @@ public partial class MainWindowViewModel : ObservableObject
}
// Load root nodes
await BrowseTree.LoadRootsAsync();
if (BrowseTree != null) await BrowseTree.LoadRootsAsync();
// Restore saved subscriptions
if (_savedSubscribedNodes.Count > 0 && Subscriptions != null)
@@ -330,7 +334,7 @@ public partial class MainWindowViewModel : ObservableObject
if (SelectedTreeNodes.Count == 0 || !IsConnected) return;
var node = SelectedTreeNodes[0];
History.SelectedNodeId = node.NodeId;
if (History != null) History.SelectedNodeId = node.NodeId;
SelectedTabIndex = 3; // History tab
}
@@ -376,7 +380,7 @@ public partial class MainWindowViewModel : ObservableObject
var s = _settingsService.Load();
EndpointUrl = s.EndpointUrl;
Username = s.Username;
Password = s.Password;
// Password is intentionally not persisted (security: re-prompt each launch)
SelectedSecurityMode = s.SecurityMode;
FailoverUrls = s.FailoverUrls;
SessionTimeoutSeconds = s.SessionTimeoutSeconds;
@@ -396,7 +400,7 @@ public partial class MainWindowViewModel : ObservableObject
{
EndpointUrl = EndpointUrl,
Username = Username,
Password = Password,
// Password is intentionally not persisted (security: re-prompt each launch)
SecurityMode = SelectedSecurityMode,
FailoverUrls = FailoverUrls,
SessionTimeoutSeconds = SessionTimeoutSeconds,
@@ -407,6 +411,21 @@ public partial class MainWindowViewModel : ObservableObject
});
}
/// <summary>
/// Detaches the connection-state handler and disposes the OPC UA client service, releasing
/// the session, certificate validator, and any background reconnect resources.
/// </summary>
public void Dispose()
{
if (_service != null)
{
_service.ConnectionStateChanged -= OnConnectionStateChanged;
Subscriptions?.Teardown();
Alarms?.Teardown();
_service.Dispose();
}
}
private static string[]? ParseFailoverUrls(string? csv)
{
if (string.IsNullOrWhiteSpace(csv))
@@ -265,6 +265,16 @@ public partial class SubscriptionsViewModel : ObservableObject
SubscriptionCount = 0;
}
/// <summary>
/// Re-hooks event handlers to the service after a server-side reconnect.
/// Safe to call when already attached (duplicate += is a no-op in .NET multicast delegates).
/// </summary>
public void Reattach()
{
_service.DataChanged -= OnDataChanged;
_service.DataChanged += OnDataChanged;
}
/// <summary>
/// Unhooks event handlers from the service.
/// </summary>
@@ -140,7 +140,10 @@ public partial class MainWindow : Window
protected override void OnClosing(WindowClosingEventArgs e)
{
if (DataContext is MainWindowViewModel vm)
{
vm.SaveSettings();
vm.Dispose();
}
base.OnClosing(e);
}
}
@@ -5,19 +5,27 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration;
/// <summary>
/// Used by <c>dotnet ef</c> at design time (migrations, scaffolding). Reads the connection string
/// from the <c>OTOPCUA_CONFIG_CONNECTION</c> environment variable, falling back to the local dev
/// container on <c>localhost:1433</c>.
/// from the <c>OTOPCUA_CONFIG_CONNECTION</c> environment variable.
/// </summary>
/// <remarks>
/// Set the variable before running migration commands, e.g.:
/// <code>
/// $env:OTOPCUA_CONFIG_CONNECTION = "Server=10.100.0.35,14330;Database=OtOpcUaConfig;Trusted_Connection=True;TrustServerCertificate=True;"
/// dotnet ef migrations add MyMigration --project src/Core/ZB.MOM.WW.OtOpcUa.Configuration
/// </code>
/// No credential is embedded in source. Do not add a plaintext password as a fallback.
/// </remarks>
public sealed class DesignTimeDbContextFactory : IDesignTimeDbContextFactory<OtOpcUaConfigDbContext>
{
// Host-port 14330 avoids collision with the native MSSQL14 instance on 1433 (Galaxy "ZB" DB).
private const string DefaultConnectionString =
"Server=localhost,14330;Database=OtOpcUaConfig;User Id=sa;Password=OtOpcUaDev_2026!;TrustServerCertificate=True;Encrypt=False;";
public OtOpcUaConfigDbContext CreateDbContext(string[] args)
{
var connection = Environment.GetEnvironmentVariable("OTOPCUA_CONFIG_CONNECTION")
?? DefaultConnectionString;
var connection = Environment.GetEnvironmentVariable("OTOPCUA_CONFIG_CONNECTION");
if (string.IsNullOrWhiteSpace(connection))
throw new InvalidOperationException(
"OTOPCUA_CONFIG_CONNECTION is not set. " +
"Export the variable before running 'dotnet ef' commands. Example: " +
"Server=10.100.0.35,14330;Database=OtOpcUaConfig;Trusted_Connection=True;TrustServerCertificate=True;");
var options = new DbContextOptionsBuilder<OtOpcUaConfigDbContext>()
.UseSqlServer(connection, sql => sql.MigrationsAssembly(typeof(OtOpcUaConfigDbContext).Assembly.FullName))
@@ -8,5 +8,11 @@ public enum NodeAclScopeKind
UnsArea,
UnsLine,
Equipment,
/// <summary>
/// A Galaxy (SystemPlatform-kind) folder segment anchored below a namespace.
/// Distinguishes folder grants from UNS <see cref="Equipment"/> grants in the
/// <c>AuthorizationDecision.Provenance</c> audit trail and Admin UI diagnostics.
/// </summary>
FolderSegment,
Tag,
}
@@ -48,7 +48,13 @@ public sealed class ResilientConfigReader
UseJitter = true,
Delay = TimeSpan.FromMilliseconds(100),
MaxDelay = TimeSpan.FromSeconds(1),
ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => ex is not OperationCanceledException),
// Handle ALL exceptions including OperationCanceledException. A SQL command-level
// timeout surfaces as TaskCanceledException (derives from OperationCanceledException)
// when the caller's token is NOT cancelled, and must be retried just like any other
// transient error. Polly itself checks the cancellation token between retries and
// stops with OperationCanceledException on genuine caller cancellation regardless of
// this predicate.
ShouldHandle = new PredicateBuilder().Handle<Exception>(),
});
}
@@ -76,7 +82,11 @@ public sealed class ResilientConfigReader
_staleFlag.MarkFresh();
return result;
}
catch (Exception ex) when (ex is not OperationCanceledException)
// Catch all exceptions that are NOT genuine caller cancellations. A SQL command-level
// timeout surfaces as TaskCanceledException (derives from OperationCanceledException)
// but the caller's token is NOT cancelled — we must fall back to the sealed cache for
// that case, not propagate. Only rethrow if the caller actually requested cancellation.
catch (Exception ex) when (ex is not OperationCanceledException || !cancellationToken.IsCancellationRequested)
{
_logger.LogWarning(ex, "Central-DB read failed after retries; falling back to sealed cache for cluster {ClusterId}", clusterId);
// GenerationCacheUnavailableException surfaces intentionally — fails the caller's
@@ -145,9 +145,12 @@ BEGIN
(NodeId, CurrentGenerationId, LastAppliedAt, LastAppliedStatus, LastAppliedError, LastSeenAt)
VALUES (@NodeId, @GenerationId, SYSUTCDATETIME(), @Status, @Error, SYSUTCDATETIME());
-- Build DetailsJson via STRING_ESCAPE so a @Status containing a double-quote/backslash cannot
-- produce malformed JSON (which would fail CK_ConfigAuditLog_DetailsJson_IsJson and abort the
-- transaction) or inject extra JSON structure into the audit record.
INSERT dbo.ConfigAuditLog (Principal, EventType, NodeId, GenerationId, DetailsJson)
VALUES (@Caller, 'NodeApplied', @NodeId, @GenerationId,
CONCAT('{""status"":""', @Status, '""}'));
CONCAT('{""status"":""', STRING_ESCAPE(@Status, 'json'), '""}'));
END
";
@@ -267,7 +270,22 @@ BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
-- Transaction-nesting awareness: if a caller (e.g. sp_RollbackToGeneration) already
-- holds a transaction, we use SAVE TRANSACTION so our failure path rolls back only to
-- the savepoint instead of issuing a bare ROLLBACK that wipes the caller's transaction
-- (which sets @@TRANCOUNT = 0 and causes error 3902 on the caller's subsequent COMMIT).
DECLARE @OwnsTxn bit = 0;
DECLARE @SaveName nvarchar(32) = N'sp_PublishGeneration';
IF @@TRANCOUNT = 0
BEGIN
BEGIN TRANSACTION;
SET @OwnsTxn = 1;
END
ELSE
BEGIN
SAVE TRANSACTION sp_PublishGeneration;
END
DECLARE @Lock nvarchar(255) = N'OtOpcUa_Publish_' + @ClusterId;
DECLARE @LockResult int;
@@ -275,11 +293,24 @@ BEGIN
IF @LockResult < 0
BEGIN
RAISERROR('PublishConflict: another publish is in progress for cluster %s', 16, 1, @ClusterId);
ROLLBACK;
IF @OwnsTxn = 1 ROLLBACK;
ELSE ROLLBACK TRANSACTION sp_PublishGeneration;
RETURN;
END
EXEC dbo.sp_ValidateDraft @DraftGenerationId = @DraftGenerationId;
-- sp_ValidateDraft signals every rejection with RAISERROR(..., 16, 1) a severity-16 error is
-- NOT batch-aborting and SET XACT_ABORT ON does not abort the transaction for it, so without a
-- TRY/CATCH control would return here and the draft would publish despite failed validation.
-- Catch the validation error, roll back the publish transaction (only to our savepoint when a
-- caller owns the outer transaction), and re-raise so the caller sees the real validation failure.
BEGIN TRY
EXEC dbo.sp_ValidateDraft @DraftGenerationId = @DraftGenerationId;
END TRY
BEGIN CATCH
IF @OwnsTxn = 1 ROLLBACK;
ELSE ROLLBACK TRANSACTION sp_PublishGeneration;
THROW;
END CATCH
MERGE dbo.ExternalIdReservation AS tgt
USING (
@@ -310,15 +341,16 @@ BEGIN
IF @@ROWCOUNT = 0
BEGIN
RAISERROR('Draft %I64d for cluster %s not found (was it already published?)', 16, 1, @DraftGenerationId, @ClusterId);
ROLLBACK;
RAISERROR('Draft %I64d for cluster %s not in Draft status (was it already published?)', 16, 1, @DraftGenerationId, @ClusterId);
IF @OwnsTxn = 1 ROLLBACK;
ELSE ROLLBACK TRANSACTION sp_PublishGeneration;
RETURN;
END
INSERT dbo.ConfigAuditLog (Principal, EventType, ClusterId, GenerationId)
VALUES (SUSER_SNAME(), 'Published', @ClusterId, @DraftGenerationId);
COMMIT;
IF @OwnsTxn = 1 COMMIT;
END
";
@@ -369,9 +401,11 @@ BEGIN
EXEC dbo.sp_PublishGeneration @ClusterId = @ClusterId, @DraftGenerationId = @NewGenId, @Notes = @Notes;
-- @TargetGenerationId is a bigint, but build the JSON value via an explicit numeric CONVERT so
-- the emitted token is always a bare JSON number never reliant on implicit string coercion.
INSERT dbo.ConfigAuditLog (Principal, EventType, ClusterId, GenerationId, DetailsJson)
VALUES (SUSER_SNAME(), 'RolledBack', @ClusterId, @NewGenId,
CONCAT('{""rolledBackTo"":', @TargetGenerationId, '}'));
CONCAT('{""rolledBackTo"":', CONVERT(nvarchar(20), CONVERT(bigint, @TargetGenerationId)), '}'));
COMMIT;
END
@@ -464,9 +498,12 @@ BEGIN
RETURN;
END
-- Escape both caller-supplied values via STRING_ESCAPE so quotes/backslashes cannot break the
-- JSON document or inject additional structure into the audit record.
INSERT dbo.ConfigAuditLog (Principal, EventType, DetailsJson)
VALUES (SUSER_SNAME(), 'ExternalIdReleased',
CONCAT('{""kind"":""', @Kind, '"",""value"":""', @Value, '""}'));
CONCAT('{""kind"":""', STRING_ESCAPE(@Kind, 'json'),
'"",""value"":""', STRING_ESCAPE(@Value, 'json'), '""}'));
END
";
}
@@ -0,0 +1,120 @@
using Microsoft.EntityFrameworkCore.Migrations;
#nullable disable
namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
{
/// <summary>
/// Admin-008: adds <c>@ReleasedBy</c> parameter to
/// <c>dbo.sp_ReleaseExternalIdReservation</c> so the operator principal name (the LDAP
/// sign-in) is recorded in <c>ExternalIdReservation.ReleasedBy</c> and the
/// <c>ConfigAuditLog.Principal</c> column.
///
/// Prior to this migration the proc used <c>SUSER_SNAME()</c> for both columns, which
/// recorded the shared SQL service account rather than the Admin-UI operator who performed
/// the release — making the audit trail useless for attribution. The stored proc now
/// accepts <c>@ReleasedBy nvarchar(128)</c> and uses it for both columns; validation
/// rejects a null/empty value the same way <c>@ReleaseReason</c> is validated.
/// </summary>
/// <inheritdoc />
public partial class AddReleasedByToReleaseExternalIdReservation : Migration
{
/// <inheritdoc />
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql(Procs.ReleaseExternalIdReservationV2);
}
/// <inheritdoc />
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql(Procs.ReleaseExternalIdReservationV1);
}
private static class Procs
{
/// <summary>V2 — accepts <c>@ReleasedBy</c> for proper operator attribution.</summary>
public const string ReleaseExternalIdReservationV2 = @"
CREATE OR ALTER PROCEDURE dbo.sp_ReleaseExternalIdReservation
@Kind nvarchar(16),
@Value nvarchar(64),
@ReleaseReason nvarchar(512),
@ReleasedBy nvarchar(128)
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
IF @ReleaseReason IS NULL OR LEN(@ReleaseReason) = 0
BEGIN
RAISERROR('ReleaseReason is required', 16, 1);
RETURN;
END
IF @ReleasedBy IS NULL OR LEN(@ReleasedBy) = 0
BEGIN
RAISERROR('ReleasedBy is required', 16, 1);
RETURN;
END
UPDATE dbo.ExternalIdReservation
SET ReleasedAt = SYSUTCDATETIME(),
ReleasedBy = @ReleasedBy,
ReleaseReason = @ReleaseReason
WHERE Kind = @Kind AND Value = @Value AND ReleasedAt IS NULL;
IF @@ROWCOUNT = 0
BEGIN
RAISERROR('No active reservation found for (%s, %s)', 16, 1, @Kind, @Value);
RETURN;
END
-- Escape all caller-supplied values via STRING_ESCAPE so quotes/backslashes cannot break the
-- JSON document or inject additional structure into the audit record.
INSERT dbo.ConfigAuditLog (Principal, EventType, DetailsJson)
VALUES (@ReleasedBy, 'ExternalIdReleased',
CONCAT('{""kind"":""', STRING_ESCAPE(@Kind, 'json'),
'"",""value"":""', STRING_ESCAPE(@Value, 'json'), '""}'));
END
";
/// <summary>V1 — original proc (uses SUSER_SNAME() for attribution). Restored on Down().</summary>
public const string ReleaseExternalIdReservationV1 = @"
CREATE OR ALTER PROCEDURE dbo.sp_ReleaseExternalIdReservation
@Kind nvarchar(16),
@Value nvarchar(64),
@ReleaseReason nvarchar(512)
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
IF @ReleaseReason IS NULL OR LEN(@ReleaseReason) = 0
BEGIN
RAISERROR('ReleaseReason is required', 16, 1);
RETURN;
END
UPDATE dbo.ExternalIdReservation
SET ReleasedAt = SYSUTCDATETIME(),
ReleasedBy = SUSER_SNAME(),
ReleaseReason = @ReleaseReason
WHERE Kind = @Kind AND Value = @Value AND ReleasedAt IS NULL;
IF @@ROWCOUNT = 0
BEGIN
RAISERROR('No active reservation found for (%s, %s)', 16, 1, @Kind, @Value);
RETURN;
END
-- Escape both caller-supplied values via STRING_ESCAPE so quotes/backslashes cannot break the
-- JSON document or inject additional structure into the audit record.
INSERT dbo.ConfigAuditLog (Principal, EventType, DetailsJson)
VALUES (SUSER_SNAME(), 'ExternalIdReleased',
CONCAT('{""kind"":""', STRING_ESCAPE(@Kind, 'json'),
'"",""value"":""', STRING_ESCAPE(@Value, 'json'), '""}'));
END
";
}
}
}
@@ -11,6 +11,18 @@ public sealed class DraftSnapshot
public required long GenerationId { get; init; }
public required string ClusterId { get; init; }
/// <summary>
/// Cluster's Enterprise segment (UNS level 1). When set, <see cref="DraftValidator"/> uses
/// the actual length for path-length checks instead of a conservative 32-char upper bound.
/// </summary>
public string? Enterprise { get; init; }
/// <summary>
/// Cluster's Site segment (UNS level 2). When set, <see cref="DraftValidator"/> uses the
/// actual length for path-length checks instead of a conservative 32-char upper bound.
/// </summary>
public string? Site { get; init; }
public IReadOnlyList<Namespace> Namespaces { get; init; } = [];
public IReadOnlyList<DriverInstance> DriverInstances { get; init; } = [];
public IReadOnlyList<Device> Devices { get; init; } = [];
@@ -59,8 +59,13 @@ public static class DraftValidator
/// <summary>Cluster.Enterprise + Site + area + line + equipment + 4 slashes ≤ 200 chars.</summary>
private static void ValidatePathLength(DraftSnapshot draft, List<ValidationError> errors)
{
// The cluster row isn't in the snapshot — we assume caller pre-validated Enterprise+Site
// length and bound them as constants <= 64 chars each. Here we validate the dynamic portion.
// Use actual Enterprise/Site lengths when the snapshot carries them (populated by
// DraftValidationService from the ServerCluster row). Fall back to a conservative
// 32-char upper bound per segment when not supplied — over-penalises short values
// but never under-penalises long ones, which is acceptable for the fallback case.
var enterpriseLen = draft.Enterprise?.Length ?? 32;
var siteLen = draft.Site?.Length ?? 32;
var areaById = draft.UnsAreas.ToDictionary(a => a.UnsAreaId);
var lineById = draft.UnsLines.ToDictionary(l => l.UnsLineId);
@@ -69,8 +74,7 @@ public static class DraftValidator
if (!lineById.TryGetValue(eq.UnsLineId!, out var line)) continue;
if (!areaById.TryGetValue(line.UnsAreaId, out var area)) continue;
// rough upper bound: Enterprise+Site at most 32+32; add dynamic segments + 4 slashes
var len = 32 + 32 + area.Name.Length + line.Name.Length + eq.Name.Length + 4;
var len = enterpriseLen + siteLen + area.Name.Length + line.Name.Length + eq.Name.Length + 4;
if (len > MaxPathLength)
errors.Add(new("PathTooLong",
$"Equipment path exceeds {MaxPathLength} chars (approx {len})",
@@ -1,3 +1,4 @@
using System.Collections;
using System.Collections.Concurrent;
namespace ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -61,7 +62,7 @@ public sealed class PollGroupEngine : IAsyncDisposable
var handle = new PollSubscriptionHandle(id);
var state = new SubscriptionState(handle, [.. fullReferences], interval, cts);
_subscriptions[id] = state;
_ = Task.Run(() => PollLoopAsync(state, cts.Token), cts.Token);
state.LoopTask = Task.Run(() => PollLoopAsync(state, cts.Token));
return handle;
}
@@ -71,13 +72,27 @@ public sealed class PollGroupEngine : IAsyncDisposable
{
if (handle is PollSubscriptionHandle h && _subscriptions.TryRemove(h.Id, out var state))
{
try { state.Cts.Cancel(); } catch { }
state.Cts.Dispose();
StopState(state);
return true;
}
return false;
}
private static void StopState(SubscriptionState state)
{
try { state.Cts.Cancel(); } catch { }
// Await the loop task (with a generous timeout) before disposing the CTS so:
// (a) no _onChange callback fires after the caller considers the engine torn down, and
// (b) the CTS is not disposed while Task.Delay is still holding a reference to its token,
// which can turn OperationCanceledException into ObjectDisposedException.
var task = state.LoopTask;
if (task is not null)
{
try { task.Wait(TimeSpan.FromSeconds(5)); } catch { }
}
state.Cts.Dispose();
}
/// <summary>Snapshot of active subscription count — exposed for driver diagnostics.</summary>
public int ActiveSubscriptionCount => _subscriptions.Count;
@@ -103,13 +118,22 @@ public sealed class PollGroupEngine : IAsyncDisposable
private async Task PollOnceAsync(SubscriptionState state, bool forceRaise, CancellationToken ct)
{
var snapshots = await _reader(state.TagReferences, ct).ConfigureAwait(false);
// Core.Abstractions-002: validate the reader contract before indexing. A reader that
// returns fewer snapshots than references would silently stall the subscription; surface
// the violation immediately with a descriptive exception instead.
if (snapshots.Count != state.TagReferences.Count)
throw new InvalidOperationException(
$"Reader contract violation: expected {state.TagReferences.Count} snapshots but received {snapshots.Count}. " +
"The reader delegate must return one snapshot per input reference in input order.");
for (var i = 0; i < state.TagReferences.Count; i++)
{
var tagRef = state.TagReferences[i];
var current = snapshots[i];
var lastSeen = state.LastValues.TryGetValue(tagRef, out var prev) ? prev : default;
if (forceRaise || !Equals(lastSeen?.Value, current.Value) || lastSeen?.StatusCode != current.StatusCode)
if (forceRaise || ValuesAreDifferent(lastSeen?.Value, current.Value) || lastSeen?.StatusCode != current.StatusCode)
{
state.LastValues[tagRef] = current;
_onChange(state.Handle, tagRef, current);
@@ -117,16 +141,44 @@ public sealed class PollGroupEngine : IAsyncDisposable
}
}
/// <summary>Cancel every active subscription. Idempotent.</summary>
public ValueTask DisposeAsync()
/// <summary>
/// Returns <c>true</c> when <paramref name="previous"/> and <paramref name="current"/>
/// represent different values. Array values are compared structurally
/// (element-by-element) so that a driver producing a fresh array instance on every poll
/// does not trigger spurious change events when the contents are identical.
/// </summary>
private static bool ValuesAreDifferent(object? previous, object? current)
{
if (previous is Array prevArr && current is Array currArr)
return !StructuralComparisons.StructuralEqualityComparer.Equals(prevArr, currArr);
return !Equals(previous, current);
}
/// <summary>Cancel every active subscription and await all loop tasks. Idempotent.</summary>
public async ValueTask DisposeAsync()
{
// Cancel all loops first so they can all start winding down in parallel.
foreach (var state in _subscriptions.Values)
{
try { state.Cts.Cancel(); } catch { }
}
// Await every loop task before disposing CTSs, ensuring no callback fires after disposal.
var waitTasks = _subscriptions.Values
.Select(s => s.LoopTask ?? Task.CompletedTask)
.ToArray();
if (waitTasks.Length > 0)
{
try { await Task.WhenAll(waitTasks).WaitAsync(TimeSpan.FromSeconds(5)).ConfigureAwait(false); }
catch { }
}
foreach (var state in _subscriptions.Values)
{
state.Cts.Dispose();
}
_subscriptions.Clear();
return ValueTask.CompletedTask;
}
private sealed record SubscriptionState(
@@ -137,6 +189,14 @@ public sealed class PollGroupEngine : IAsyncDisposable
{
public ConcurrentDictionary<string, DataValueSnapshot> LastValues { get; }
= new(StringComparer.OrdinalIgnoreCase);
/// <summary>
/// The background poll-loop task. Assigned immediately after creation in
/// <see cref="Subscribe"/>; awaited during <see cref="Unsubscribe"/> /
/// <see cref="DisposeAsync"/> so disposal is deterministic and no
/// <c>_onChange</c> callback can fire after the caller tears down the subscription.
/// </summary>
public Task? LoopTask { get; set; }
}
private sealed record PollSubscriptionHandle(long Id) : ISubscriptionHandle
@@ -17,7 +17,8 @@ namespace ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian;
/// Which state transition this event represents — "Activated" / "Cleared" /
/// "Acknowledged" / "Confirmed" / "Shelved" / "Unshelved" / "Disabled" / "Enabled" /
/// "CommentAdded". Free-form string because different alarm sources use different
/// vocabularies; the Galaxy.Host handler maps to the historian's enum on the wire.
/// vocabularies; the Wonderware historian sidecar (<c>WonderwareHistorianClient</c>)
/// maps to the historian's enum on the wire.
/// </param>
/// <param name="Message">Fully-rendered message text — template tokens already resolved upstream.</param>
/// <param name="User">Operator who triggered the transition. "system" for engine-driven events (shelving expiry, predicate change).</param>
@@ -2,9 +2,9 @@ namespace ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian;
/// <summary>
/// The historian sink contract — where qualifying alarm events land. Phase 7 plan
/// decision #17: ingestion routes through Galaxy.Host's pipe so we reuse the
/// already-loaded <c>aahClientManaged</c> DLLs without loading 32-bit native code
/// in the main .NET 10 server. Tests use an in-memory fake; production uses
/// decision #17: ingestion routes through the Wonderware historian sidecar
/// (<c>WonderwareHistorianClient</c>), which owns the <c>aahClientManaged</c> DLLs
/// and 32-bit constraints. Tests use an in-memory fake; production uses
/// <see cref="SqliteStoreAndForwardSink"/>.
/// </summary>
/// <remarks>
@@ -45,13 +45,25 @@ public sealed class NullAlarmHistorianSink : IAlarmHistorianSink
}
/// <summary>Diagnostic snapshot surfaced to the Admin UI + /healthz endpoints.</summary>
/// <param name="QueueDepth">Non-dead-lettered rows waiting to be drained to the historian.</param>
/// <param name="DeadLetterDepth">Rows that have been permanently failed or have corrupt payloads; retained until the retention window expires.</param>
/// <param name="LastDrainUtc">UTC timestamp of the most recent drain attempt, or <c>null</c> if no drain has run yet.</param>
/// <param name="LastSuccessUtc">UTC timestamp of the most recent drain tick that acknowledged at least one row, or <c>null</c> if none.</param>
/// <param name="LastError">Message from the most recent writer exception or cardinality violation, cleared on the next successful batch.</param>
/// <param name="DrainState">Current state of the drain worker.</param>
/// <param name="EvictedCount">
/// Lifetime count of non-dead-lettered rows discarded because the queue reached
/// its configured capacity. Non-zero values indicate that accepted alarm events
/// were dropped before reaching the historian — operator attention required.
/// </param>
public sealed record HistorianSinkStatus(
long QueueDepth,
long DeadLetterDepth,
DateTime? LastDrainUtc,
DateTime? LastSuccessUtc,
string? LastError,
HistorianDrainState DrainState);
HistorianDrainState DrainState,
long EvictedCount = 0);
/// <summary>Where the drain worker is in its state machine.</summary>
public enum HistorianDrainState
@@ -62,7 +74,7 @@ public enum HistorianDrainState
BackingOff,
}
/// <summary>Signaled by the Galaxy.Host-side handler when it fails a batch — drain worker uses this to decide retry cadence.</summary>
/// <summary>Returned by the Wonderware historian sidecar per event — drain worker uses this to decide retry cadence.</summary>
public enum HistorianWriteOutcome
{
/// <summary>Successfully persisted to the historian. Remove from queue.</summary>
@@ -73,7 +85,7 @@ public enum HistorianWriteOutcome
PermanentFail,
}
/// <summary>What the drain worker delegates writes to — Stream G wires this to the Galaxy.Host IPC client.</summary>
/// <summary>What the drain worker delegates writes to — production is <c>WonderwareHistorianClient</c> (the Wonderware historian sidecar).</summary>
public interface IAlarmHistorianWriter
{
/// <summary>Push a batch of events to the historian. Returns one outcome per event, same order.</summary>
@@ -6,9 +6,10 @@ namespace ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian;
/// <summary>
/// Phase 7 plan decisions #16#17 implementation: durable SQLite queue on the node
/// absorbs every qualifying alarm event, a drain worker batches rows to Galaxy.Host
/// via <see cref="IAlarmHistorianWriter"/> on an exponential-backoff cadence, and
/// operator acks never block on the historian being reachable.
/// absorbs every qualifying alarm event, a drain worker batches rows to the
/// Wonderware historian sidecar via <see cref="IAlarmHistorianWriter"/> on an
/// exponential-backoff cadence, and operator acks never block on the historian
/// being reachable.
/// </summary>
/// <remarks>
/// <para>
@@ -28,12 +29,18 @@ namespace ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian;
/// Dead-lettered rows stay in place for the configured retention window (default
/// 30 days per Phase 7 plan decision #21) so operators can inspect + manually
/// retry before the sweeper purges them. Regular queue capacity is bounded —
/// overflow evicts the oldest non-dead-lettered rows with a WARN log.
/// overflow evicts the oldest non-dead-lettered rows with a WARN log. The
/// durability guarantee is therefore bounded by <see cref="DefaultCapacity"/>:
/// under a sustained historian outage, accepted events may be evicted before
/// delivery. The <see cref="HistorianSinkStatus.EvictedCount"/> counter makes
/// overflow visible to operators without requiring the WARN log to be scraped.
/// </para>
/// <para>
/// Drain runs on a shared <see cref="System.Threading.Timer"/>. Exponential
/// backoff on <see cref="HistorianWriteOutcome.RetryPlease"/>: 1s → 2s → 5s →
/// 15s → 60s cap. <see cref="HistorianWriteOutcome.PermanentFail"/> rows flip
/// Drain runs on a self-rescheduling one-shot <see cref="System.Threading.Timer"/>.
/// Exponential backoff on <see cref="HistorianWriteOutcome.RetryPlease"/>:
/// 1s → 2s → 5s → 15s → 60s cap — the backoff is applied to the timer's next
/// due-time, so a historian outage genuinely slows the drain cadence.
/// <see cref="HistorianWriteOutcome.PermanentFail"/> rows flip
/// the <c>DeadLettered</c> flag on the individual row; neighbors in the batch
/// still retry on their own cadence.
/// </para>
@@ -63,12 +70,22 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
private readonly SemaphoreSlim _drainGate = new(1, 1);
private Timer? _drainTimer;
private TimeSpan _tickInterval;
private int _backoffIndex;
private bool _disposed;
// Core.AlarmHistorian-005: status fields written by the drain timer thread and
// read concurrently by GetStatus() / health-check threads. Guard all reads and
// writes with this lock so the Admin UI never observes a torn or stale value.
private readonly object _statusLock = new();
private DateTime? _lastDrainUtc;
private DateTime? _lastSuccessUtc;
private string? _lastError;
private HistorianDrainState _drainState = HistorianDrainState.Idle;
private bool _disposed;
// Core.AlarmHistorian-009: lifetime counter of rows evicted due to capacity overflow.
// Surfaces in HistorianSinkStatus so operators can see data-loss events without
// having to scrape the WARN log.
private long _evictedCount;
public SqliteStoreAndForwardSink(
string databasePath,
@@ -87,32 +104,126 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
_capacity = capacity > 0 ? capacity : throw new ArgumentOutOfRangeException(nameof(capacity));
_deadLetterRetention = deadLetterRetention ?? DefaultDeadLetterRetention;
_clock = clock ?? (() => DateTime.UtcNow);
_connectionString = $"Data Source={databasePath}";
// DefaultTimeout gives ADO.NET command-level retry; the PRAGMA busy_timeout
// applied in OpenConnection backs it with SQLite's own busy-handler so an
// enqueue/drain collision waits out the file lock instead of throwing
// SQLITE_BUSY immediately (Core.AlarmHistorian-004).
_connectionString = new SqliteConnectionStringBuilder
{
DataSource = databasePath,
DefaultTimeout = 5,
}.ToString();
InitializeSchema();
}
/// <summary>
/// Open a connection with the busy timeout + WAL journal applied. SQLite
/// serializes writers with a file lock; the busy_timeout lets a writer wait
/// out a competing lock (default is 0 — fail fast), and WAL lets readers and
/// the single writer proceed without blocking each other.
/// </summary>
private SqliteConnection OpenConnection()
{
var conn = new SqliteConnection(_connectionString);
conn.Open();
ApplyPragmas(conn);
return conn;
}
/// <summary>Apply busy_timeout + WAL pragmas to an already-open connection (sync).</summary>
private static void ApplyPragmas(SqliteConnection conn)
{
using var pragma = conn.CreateCommand();
pragma.CommandText = "PRAGMA busy_timeout=5000; PRAGMA journal_mode=WAL;";
pragma.ExecuteNonQuery();
}
/// <summary>Apply busy_timeout + WAL pragmas to an already-open connection (async).</summary>
private static async Task ApplyPragmasAsync(SqliteConnection conn, CancellationToken ct)
{
using var pragma = conn.CreateCommand();
pragma.CommandText = "PRAGMA busy_timeout=5000; PRAGMA journal_mode=WAL;";
await pragma.ExecuteNonQueryAsync(ct).ConfigureAwait(false);
}
/// <summary>
/// Start the background drain worker. Not started automatically so tests can
/// drive <see cref="DrainOnceAsync"/> deterministically.
/// </summary>
/// <remarks>
/// The worker is a self-rescheduling one-shot <see cref="Timer"/>: after each
/// drain it sets its next due-time to <c>max(tickInterval, CurrentBackoff)</c>
/// so a historian outage actually slows the cadence down the backoff ladder
/// (Core.AlarmHistorian-002). The callback body is fully guarded — a fault in
/// <see cref="DrainOnceAsync"/> is logged and recorded into
/// <see cref="GetStatus"/> rather than being lost as an unobserved task
/// exception (Core.AlarmHistorian-006).
/// </remarks>
public void StartDrainLoop(TimeSpan tickInterval)
{
if (_disposed) throw new ObjectDisposedException(nameof(SqliteStoreAndForwardSink));
_tickInterval = tickInterval;
_drainTimer?.Dispose();
_drainTimer = new Timer(_ => _ = DrainOnceAsync(CancellationToken.None),
null, tickInterval, tickInterval);
// One-shot: dueTime = tickInterval, period = Infinite. RescheduleDrain re-arms
// it after every tick once the backoff-aware delay is known.
_drainTimer = new Timer(DrainTimerCallback, null, tickInterval, Timeout.InfiniteTimeSpan);
}
public Task EnqueueAsync(AlarmHistorianEvent evt, CancellationToken cancellationToken)
private async void DrainTimerCallback(object? _)
{
try
{
await DrainOnceAsync(CancellationToken.None).ConfigureAwait(false);
}
catch (Exception ex)
{
// Without this catch the fault would be an unobserved exception on an
// async-void timer callback — never logged, never surfaced. Record it
// so the Admin UI / health check sees the stalled drain.
lock (_statusLock)
{
_lastError = ex.Message;
_drainState = HistorianDrainState.BackingOff;
}
_logger.Error(ex, "Historian drain tick faulted; will retry on next tick");
}
finally
{
RescheduleDrain();
}
}
/// <summary>Re-arm the one-shot drain timer honoring the current backoff window.</summary>
private void RescheduleDrain()
{
if (_disposed) return;
HistorianDrainState state;
lock (_statusLock) { state = _drainState; }
// While backing off, wait out the full ladder delay; otherwise the steady
// tick cadence. Never faster than tickInterval.
var delay = state == HistorianDrainState.BackingOff
? (CurrentBackoff > _tickInterval ? CurrentBackoff : _tickInterval)
: _tickInterval;
try { _drainTimer?.Change(delay, Timeout.InfiniteTimeSpan); }
catch (ObjectDisposedException) { /* raced with Dispose — nothing to re-arm */ }
}
// Core.AlarmHistorian-003: use async SQLite APIs so the emitting thread is not
// blocked waiting for a file-lock or disk write; honor the cancellationToken
// throughout. Microsoft.Data.Sqlite's async surface (OpenAsync /
// ExecuteNonQueryAsync) is a thin wrapper over the synchronous path, so the
// blocking still happens — but on a thread-pool thread, not the caller's thread.
public async Task EnqueueAsync(AlarmHistorianEvent evt, CancellationToken cancellationToken)
{
if (evt is null) throw new ArgumentNullException(nameof(evt));
if (_disposed) throw new ObjectDisposedException(nameof(SqliteStoreAndForwardSink));
using var conn = new SqliteConnection(_connectionString);
conn.Open();
await conn.OpenAsync(cancellationToken).ConfigureAwait(false);
await ApplyPragmasAsync(conn, cancellationToken).ConfigureAwait(false);
EnforceCapacity(conn);
await EnforceCapacityAsync(conn, cancellationToken).ConfigureAwait(false);
using var cmd = conn.CreateCommand();
cmd.CommandText = """
@@ -122,8 +233,7 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
cmd.Parameters.AddWithValue("$alarmId", evt.AlarmId);
cmd.Parameters.AddWithValue("$enqueued", _clock().ToString("O"));
cmd.Parameters.AddWithValue("$payload", JsonSerializer.Serialize(evt));
cmd.ExecuteNonQuery();
return Task.CompletedTask;
await cmd.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false);
}
/// <summary>
@@ -138,14 +248,42 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
if (!await _drainGate.WaitAsync(0, ct).ConfigureAwait(false)) return;
try
{
_drainState = HistorianDrainState.Draining;
_lastDrainUtc = _clock();
lock (_statusLock)
{
_drainState = HistorianDrainState.Draining;
_lastDrainUtc = _clock();
}
PurgeAgedDeadLetters();
var (rowIds, events) = ReadBatch();
if (rowIds.Count == 0)
var batch = ReadBatch();
if (batch.Count == 0)
{
_drainState = HistorianDrainState.Idle;
lock (_statusLock) { _drainState = HistorianDrainState.Idle; }
return;
}
// A null/un-deserializable payload can never succeed — dead-letter it
// immediately for its own RowId so it cannot stall the queue head, and
// exclude it from the batch handed to the writer.
var corruptRowIds = batch.Where(r => r.Event is null).Select(r => r.RowId).ToList();
var liveRows = batch.Where(r => r.Event is not null).ToList();
var events = liveRows.Select(r => r.Event!).ToList();
if (corruptRowIds.Count > 0)
{
using var corruptConn = OpenConnection();
using var corruptTx = corruptConn.BeginTransaction();
foreach (var rowId in corruptRowIds)
DeadLetterRow(corruptConn, corruptTx, rowId, $"corrupt payload at {_clock():O}");
corruptTx.Commit();
_logger.Warning(
"Dead-lettered {Count} historian queue row(s) with un-deserializable payload",
corruptRowIds.Count);
}
if (events.Count == 0)
{
lock (_statusLock) { _drainState = HistorianDrainState.Idle; }
return;
}
@@ -153,7 +291,7 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
try
{
outcomes = await _writer.WriteBatchAsync(events, ct).ConfigureAwait(false);
_lastError = null;
lock (_statusLock) { _lastError = null; }
}
catch (OperationCanceledException)
{
@@ -162,24 +300,42 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
catch (Exception ex)
{
// Writer-side exception — treat entire batch as RetryPlease.
_lastError = ex.Message;
lock (_statusLock)
{
_lastError = ex.Message;
_drainState = HistorianDrainState.BackingOff;
}
_logger.Warning(ex, "Historian writer threw on batch of {Count}; deferring retry", events.Count);
BumpBackoff();
_drainState = HistorianDrainState.BackingOff;
return;
}
// Core.AlarmHistorian-007: a cardinality mismatch is a writer contract
// violation — potentially the events were already persisted. Rather than
// throwing (which, pre -006 fix, was swallowed and left _drainState
// stale), treat it as a transient batch failure so the rows stay queued
// and the backoff surface becomes visible to the operator. A deterministic
// mismatch will stall the row until an operator intervenes or the writer
// is fixed — far safer than re-throwing into a fire-and-forget timer.
if (outcomes.Count != events.Count)
throw new InvalidOperationException(
$"Writer returned {outcomes.Count} outcomes for {events.Count} events — expected 1:1");
{
var msg = $"Writer returned {outcomes.Count} outcomes for {events.Count} events — expected 1:1; treating as batch retry";
lock (_statusLock)
{
_lastError = msg;
_drainState = HistorianDrainState.BackingOff;
}
_logger.Warning("Historian writer contract violation: {Msg}", msg);
BumpBackoff();
return;
}
using var conn = new SqliteConnection(_connectionString);
conn.Open();
using var conn = OpenConnection();
using var tx = conn.BeginTransaction();
for (var i = 0; i < outcomes.Count; i++)
{
var outcome = outcomes[i];
var rowId = rowIds[i];
var rowId = liveRows[i].RowId;
switch (outcome)
{
case HistorianWriteOutcome.Ack:
@@ -196,18 +352,20 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
tx.Commit();
var acks = outcomes.Count(o => o == HistorianWriteOutcome.Ack);
if (acks > 0) _lastSuccessUtc = _clock();
lock (_statusLock)
{
if (acks > 0) _lastSuccessUtc = _clock();
if (outcomes.Any(o => o == HistorianWriteOutcome.RetryPlease))
_drainState = HistorianDrainState.BackingOff;
else
_drainState = HistorianDrainState.Idle;
}
if (outcomes.Any(o => o == HistorianWriteOutcome.RetryPlease))
{
BumpBackoff();
_drainState = HistorianDrainState.BackingOff;
}
else
{
ResetBackoff();
_drainState = HistorianDrainState.Idle;
}
}
finally
{
@@ -217,8 +375,7 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
public HistorianSinkStatus GetStatus()
{
using var conn = new SqliteConnection(_connectionString);
conn.Open();
using var conn = OpenConnection();
long queued;
long deadlettered;
@@ -233,31 +390,52 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
deadlettered = (long)(cmd.ExecuteScalar() ?? 0L);
}
// Core.AlarmHistorian-005: snapshot status fields atomically under the lock
// so the Admin UI never sees a torn DateTime? or stale DrainState.
DateTime? lastDrain, lastSuccess;
string? lastError;
HistorianDrainState drainState;
long evicted;
lock (_statusLock)
{
lastDrain = _lastDrainUtc;
lastSuccess = _lastSuccessUtc;
lastError = _lastError;
drainState = _drainState;
evicted = _evictedCount;
}
return new HistorianSinkStatus(
QueueDepth: queued,
DeadLetterDepth: deadlettered,
LastDrainUtc: _lastDrainUtc,
LastSuccessUtc: _lastSuccessUtc,
LastError: _lastError,
DrainState: _drainState);
LastDrainUtc: lastDrain,
LastSuccessUtc: lastSuccess,
LastError: lastError,
DrainState: drainState,
EvictedCount: evicted);
}
/// <summary>Operator action from Admin UI — retry every dead-lettered row. Non-cascading: they rejoin the regular queue + get a fresh backoff.</summary>
public int RetryDeadLettered()
{
using var conn = new SqliteConnection(_connectionString);
conn.Open();
using var conn = OpenConnection();
using var cmd = conn.CreateCommand();
cmd.CommandText = "UPDATE Queue SET DeadLettered = 0, AttemptCount = 0, LastError = NULL WHERE DeadLettered = 1";
return cmd.ExecuteNonQuery();
}
private (List<long> rowIds, List<AlarmHistorianEvent> events) ReadBatch()
/// <summary>
/// One queued row paired with its deserialized event. <see cref="Event"/> is
/// <c>null</c> when the row's <c>PayloadJson</c> is corrupt or un-deserializable —
/// the <see cref="RowId"/> always stays bound to its own row so outcomes can
/// never be mapped to the wrong row.
/// </summary>
private readonly record struct QueueRow(long RowId, AlarmHistorianEvent? Event);
private List<QueueRow> ReadBatch()
{
var rowIds = new List<long>();
var events = new List<AlarmHistorianEvent>();
using var conn = new SqliteConnection(_connectionString);
conn.Open();
var rows = new List<QueueRow>();
using var conn = OpenConnection();
using var cmd = conn.CreateCommand();
cmd.CommandText = """
SELECT RowId, PayloadJson FROM Queue
@@ -269,12 +447,21 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
using var reader = cmd.ExecuteReader();
while (reader.Read())
{
rowIds.Add(reader.GetInt64(0));
var rowId = reader.GetInt64(0);
var payload = reader.GetString(1);
var evt = JsonSerializer.Deserialize<AlarmHistorianEvent>(payload);
if (evt is not null) events.Add(evt);
AlarmHistorianEvent? evt;
try
{
evt = JsonSerializer.Deserialize<AlarmHistorianEvent>(payload);
}
catch (JsonException)
{
// Malformed JSON — carry a null event so the caller dead-letters this row.
evt = null;
}
rows.Add(new QueueRow(rowId, evt));
}
return (rowIds, events);
return rows;
}
private static void DeleteRow(SqliteConnection conn, SqliteTransaction tx, long rowId)
@@ -341,16 +528,50 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
cmd.Parameters.AddWithValue("$n", toEvict);
cmd.ExecuteNonQuery();
}
// Core.AlarmHistorian-009: increment the lifetime eviction counter so the
// Admin UI / health check can report overflow without requiring log scraping.
lock (_statusLock) { _evictedCount += toEvict; }
_logger.Warning(
"Historian queue at capacity {Cap} — evicted {Count} oldest row(s) to make room",
_capacity, toEvict);
"Historian queue at capacity {Cap} — evicted {Count} oldest row(s) to make room (lifetime evictions: {Total})",
_capacity, toEvict, _evictedCount);
}
// Async variant used by EnqueueAsync (Core.AlarmHistorian-003).
private async Task EnforceCapacityAsync(SqliteConnection conn, CancellationToken ct)
{
long count;
using (var cmd = conn.CreateCommand())
{
cmd.CommandText = "SELECT COUNT(*) FROM Queue WHERE DeadLettered = 0";
count = (long)(await cmd.ExecuteScalarAsync(ct).ConfigureAwait(false) ?? 0L);
}
if (count < _capacity) return;
var toEvict = count - _capacity + 1;
using (var cmd = conn.CreateCommand())
{
cmd.CommandText = """
DELETE FROM Queue
WHERE RowId IN (
SELECT RowId FROM Queue
WHERE DeadLettered = 0
ORDER BY RowId ASC
LIMIT $n
)
""";
cmd.Parameters.AddWithValue("$n", toEvict);
await cmd.ExecuteNonQueryAsync(ct).ConfigureAwait(false);
}
lock (_statusLock) { _evictedCount += toEvict; }
_logger.Warning(
"Historian queue at capacity {Cap} — evicted {Count} oldest row(s) to make room (lifetime evictions: {Total})",
_capacity, toEvict, _evictedCount);
}
private void PurgeAgedDeadLetters()
{
var cutoff = (_clock() - _deadLetterRetention).ToString("O");
using var conn = new SqliteConnection(_connectionString);
conn.Open();
using var conn = OpenConnection();
using var cmd = conn.CreateCommand();
cmd.CommandText = """
DELETE FROM Queue
@@ -364,8 +585,7 @@ public sealed class SqliteStoreAndForwardSink : IAlarmHistorianSink, IDisposable
private void InitializeSchema()
{
using var conn = new SqliteConnection(_connectionString);
conn.Open();
using var conn = OpenConnection();
using var cmd = conn.CreateCommand();
cmd.CommandText = """
CREATE TABLE IF NOT EXISTS Queue (
@@ -39,7 +39,15 @@ public sealed class ScriptedAlarmEngine : IDisposable
private readonly Func<DateTime> _clock;
private readonly TimeSpan _scriptTimeout;
private readonly Dictionary<string, AlarmState> _alarms = new(StringComparer.Ordinal);
// ConcurrentDictionary, not a plain Dictionary: every mutation happens under
// _evalGate, but four read paths (GetState, GetAllStates, LoadedAlarmIds,
// RunShelvingCheck) touch _alarms from arbitrary threads (Admin UI request
// threads, the shelving Timer thread-pool callback) without holding the gate.
// A plain Dictionary read concurrent with a writer's entry reassignment can
// throw or return torn state; ConcurrentDictionary makes entry assignment and
// snapshot enumeration safe. The only write shapes are indexer-set and Clear,
// both of which ConcurrentDictionary supports atomically. (Core.ScriptedAlarms-001)
private readonly ConcurrentDictionary<string, AlarmState> _alarms = new(StringComparer.Ordinal);
private readonly ConcurrentDictionary<string, DataValueSnapshot> _valueCache
= new(StringComparer.Ordinal);
private readonly Dictionary<string, HashSet<string>> _alarmsReferencing
@@ -70,7 +78,7 @@ public sealed class ScriptedAlarmEngine : IDisposable
/// <summary>Raised for every emission the Part9StateMachine produces that the engine should publish.</summary>
public event EventHandler<ScriptedAlarmEvent>? OnEvent;
public IReadOnlyCollection<string> LoadedAlarmIds => _alarms.Keys;
public IReadOnlyCollection<string> LoadedAlarmIds => _alarms.Keys.ToArray();
/// <summary>
/// Load a batch of alarm definitions. Compiles every predicate, aggregates any
@@ -135,12 +143,17 @@ public sealed class ScriptedAlarmEngine : IDisposable
+ string.Join("\n ", compileFailures));
}
// Seed the value cache with current upstream values + subscribe for changes.
// Seed the value cache with current tag values before subscribing. The
// ReadTag calls happen first so that the initial predicate evaluation below
// (startup recovery, decision #14) uses a consistent snapshot.
// Subscriptions are established AFTER _loaded = true so that any synchronous
// initial-push an ITagUpstreamSource delivers from inside SubscribeTag arrives
// when _alarms is fully initialised. Before _loaded = true, a synchronous push
// would race the in-progress state restore and could overwrite the carefully
// seeded cache with a push that has no defined ordering relative to ReadTag.
// (Core.ScriptedAlarms-004)
foreach (var path in _alarmsReferencing.Keys)
{
_valueCache[path] = _upstream.ReadTag(path);
_upstreamSubscriptions.Add(_upstream.SubscribeTag(path, OnUpstreamChange));
}
// Restore persisted state, falling back to Fresh where nothing was saved,
// then re-derive ActiveState from the current predicate per decision #14.
@@ -155,8 +168,21 @@ public sealed class ScriptedAlarmEngine : IDisposable
}
_loaded = true;
// Subscribe after _loaded = true and full state restore. If an upstream
// implementation pushes its initial value synchronously from inside
// SubscribeTag, OnUpstreamChange will queue a ReevaluateAsync that acquires
// _evalGate — it will correctly block until LoadAsync releases the gate, then
// re-evaluate against the fully-populated _alarms dict.
foreach (var path in _alarmsReferencing.Keys)
_upstreamSubscriptions.Add(_upstream.SubscribeTag(path, OnUpstreamChange));
_engineLogger.Information("ScriptedAlarmEngine loaded {Count} alarm(s)", _alarms.Count);
// Dispose any previously-created timer before reassigning; a second LoadAsync
// call without this would leave two timers firing against the same engine.
// (Core.ScriptedAlarms-002)
_shelvingTimer?.Dispose();
// Start the shelving-check timer — ticks every 5s, expires any timed shelves
// that have passed their UnshelveAtUtc.
_shelvingTimer = new Timer(_ => RunShelvingCheck(),
@@ -212,8 +238,12 @@ public sealed class ScriptedAlarmEngine : IDisposable
try
{
var result = op(state.Condition);
_alarms[alarmId] = state with { Condition = result.State };
// Persist BEFORE updating in-memory so a store failure leaves both
// in-memory and persisted at the prior state rather than diverging.
// If SaveAsync throws the in-memory _alarms entry stays unchanged and
// the exception propagates to the caller. (Core.ScriptedAlarms-007)
await _store.SaveAsync(result.State, ct).ConfigureAwait(false);
_alarms[alarmId] = state with { Condition = result.State };
if (result.Emission != EmissionKind.None) EmitEvent(state, result.State, result.Emission);
}
finally { _evalGate.Release(); }
@@ -240,6 +270,12 @@ public sealed class ScriptedAlarmEngine : IDisposable
await _evalGate.WaitAsync(ct).ConfigureAwait(false);
try
{
// Re-check after acquiring the gate: a Dispose() call may have
// completed between our _evalGate.WaitAsync and here. Writing to a
// disposing store or mutating _alarms after clear is unsafe.
// (Core.ScriptedAlarms-005)
if (_disposed) return;
foreach (var id in alarmIds)
{
if (!_alarms.TryGetValue(id, out var state)) continue;
@@ -247,8 +283,10 @@ public sealed class ScriptedAlarmEngine : IDisposable
state, state.Condition, _clock(), ct).ConfigureAwait(false);
if (!ReferenceEquals(newState, state.Condition))
{
_alarms[id] = state with { Condition = newState };
// Persist before updating in-memory so a store failure leaves
// both sides at the prior state. (Core.ScriptedAlarms-007)
await _store.SaveAsync(newState, ct).ConfigureAwait(false);
_alarms[id] = state with { Condition = newState };
}
}
}
@@ -369,6 +407,13 @@ public sealed class ScriptedAlarmEngine : IDisposable
_ = ShelvingCheckAsync(ids, CancellationToken.None);
}
/// <summary>
/// Test hook — triggers a shelving check synchronously without waiting for
/// the 5-second timer. Allows tests that inject a controllable clock to advance
/// time and immediately drive timed-shelve expiry. (Core.ScriptedAlarms-012)
/// </summary>
internal void RunShelvingCheckForTest() => RunShelvingCheck();
private async Task ShelvingCheckAsync(IReadOnlyList<string> alarmIds, CancellationToken ct)
{
try
@@ -376,6 +421,13 @@ public sealed class ScriptedAlarmEngine : IDisposable
await _evalGate.WaitAsync(ct).ConfigureAwait(false);
try
{
// Re-check after acquiring the gate: Timer.Dispose() does not wait for
// running callbacks, so a shelving-check callback that passed the _disposed
// check in RunShelvingCheck can arrive here after Dispose() has returned.
// Mutating _alarms or saving to a disposed store here is unsafe.
// (Core.ScriptedAlarms-005)
if (_disposed) return;
var now = _clock();
foreach (var id in alarmIds)
{
@@ -383,8 +435,10 @@ public sealed class ScriptedAlarmEngine : IDisposable
var result = Part9StateMachine.ApplyShelvingCheck(state.Condition, now);
if (!ReferenceEquals(result.State, state.Condition))
{
_alarms[id] = state with { Condition = result.State };
// Persist before updating in-memory so a store failure leaves
// both sides at the prior state. (Core.ScriptedAlarms-007)
await _store.SaveAsync(result.State, ct).ConfigureAwait(false);
_alarms[id] = state with { Condition = result.State };
if (result.Emission != EmissionKind.None)
EmitEvent(state, result.State, result.Emission);
}
@@ -419,7 +473,11 @@ public sealed class ScriptedAlarmEngine : IDisposable
_disposed = true;
_shelvingTimer?.Dispose();
UnsubscribeFromUpstream();
_alarms.Clear();
// Do NOT clear _alarms here: Timer.Dispose() does not wait for in-flight callbacks,
// so a ShelvingCheckAsync or ReevaluateAsync can still be running inside _evalGate.
// Those paths now re-check _disposed after acquiring the gate and bail out safely.
// Clearing _alarms outside the gate would race concurrent reads and is unnecessary
// (the whole object is being discarded). (Core.ScriptedAlarms-005)
_alarmsReferencing.Clear();
}
@@ -21,11 +21,15 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// token.
/// </para>
/// <para>
/// Identifier matching is by spelling: the extractor looks for
/// <c>ctx.GetTag(...)</c> / <c>ctx.SetVirtualTag(...)</c> literally. A deliberately
/// misspelled method call (<c>ctx.GetTagz</c>) is not picked up but will also fail
/// to compile against <see cref="ScriptContext"/>, so there's no way to smuggle a
/// dependency past the extractor while still having a working script.
/// Matching is by spelling: the extractor looks for member-access invocations
/// whose receiver identifier is literally <c>ctx</c> and whose method name is
/// <c>GetTag</c> or <c>SetVirtualTag</c>. A deliberately misspelled method call
/// (<c>ctx.GetTagz</c>) is not picked up but will also fail to compile against
/// <see cref="ScriptContext"/>, so there is no way to smuggle a dependency past the
/// extractor while still having a working script. Calls with the same method name on
/// a different receiver (<c>other.GetTag("X")</c>) are explicitly ignored so that
/// scripts defining local helper types with matching names do not produce spurious
/// dependencies. (Core.Scripting-004.)
/// </para>
/// </remarks>
public static class DependencyExtractor
@@ -67,10 +71,15 @@ public static class DependencyExtractor
public override void VisitInvocationExpression(InvocationExpressionSyntax node)
{
// Only interested in member-access form: ctx.GetTag(...) / ctx.SetVirtualTag(...).
// Anything else (free functions, chained calls, static calls) is ignored — but
// still visit children in case a ctx.GetTag call is nested inside.
if (node.Expression is MemberAccessExpressionSyntax member)
// Only interested in ctx.GetTag(...) / ctx.SetVirtualTag(...) — member-access
// form where the receiver is the identifier "ctx" (the ScriptGlobals<T>.ctx
// field). Calls with the same method name on a different receiver (e.g.
// someHelper.GetTag("X")) are ignored — not picking them up avoids spurious
// dependencies when scripts define local types with matching method names.
// (Core.Scripting-004.)
if (node.Expression is MemberAccessExpressionSyntax member
&& member.Expression is IdentifierNameSyntax receiver
&& receiver.Identifier.ValueText == "ctx")
{
var methodName = member.Name.Identifier.ValueText;
if (methodName is nameof(ScriptContext.GetTag) or nameof(ScriptContext.SetVirtualTag))
@@ -18,12 +18,12 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <remarks>
/// <para>
/// Deny-list is the authoritative Phase 7 plan decision #6 set:
/// <c>System.IO</c>, <c>System.Net</c>, <c>System.Diagnostics.Process</c>,
/// <c>System.IO</c>, <c>System.Net</c>, <c>System.Diagnostics</c>,
/// <c>System.Reflection</c>, <c>System.Threading.Thread</c>,
/// <c>System.Runtime.InteropServices</c>. <c>System.Environment</c> (for process
/// env-var read) is explicitly left allowed — it's read-only process state, doesn't
/// persist outside, and the test file pins this compromise so tightening later is
/// a deliberate plan decision.
/// <c>System.Threading.Tasks</c> (scripts are synchronous predicates — no
/// legitimate need to start background tasks; a <c>Task.Run</c> fan-out outlives
/// the evaluation timeout entirely), <c>System.Runtime.InteropServices</c>,
/// <c>Microsoft.Win32</c>. (Core.Scripting-003.)
/// </para>
/// <para>
/// Deny-list prefix match. <c>System.Net</c> catches <c>System.Net.Http</c>,
@@ -32,6 +32,21 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// operator audience authors it through a helper the plan team adds as part of
/// the <see cref="ScriptContext"/> surface, not by unlocking the namespace.
/// </para>
/// <para>
/// A namespace-prefix deny-list is necessary but not sufficient: dangerous types
/// such as <c>System.Environment</c>, <c>System.AppDomain</c>, <c>System.GC</c>,
/// and <c>System.Activator</c> live <em>directly</em> in the <c>System</c>
/// namespace inside <c>System.Private.CoreLib</c> — the same allow-listed assembly
/// that supplies primitives (<c>int</c>, <c>string</c>, <c>Math</c>). They cannot
/// be blocked by namespace because <c>System</c> itself must stay allowed. They
/// are therefore denied <em>type-granularly</em> via
/// <see cref="ForbiddenFullTypeNames"/>. <c>Environment.Exit</c> /
/// <c>Environment.FailFast</c> kill the in-process OPC UA server outright;
/// <c>Activator.CreateInstance</c> is a reflection-equivalent escape; <c>GC</c>
/// and <c>AppDomain</c> expose process-wide control. Legitimate <c>System</c>
/// types (<c>Math</c>, <c>String</c>, <c>Convert</c>, <c>DateTime</c>, …) are not
/// on the list and stay usable. (Core.Scripting-001.)
/// </para>
/// </remarks>
public static class ForbiddenTypeAnalyzer
{
@@ -46,11 +61,58 @@ public static class ForbiddenTypeAnalyzer
[
"System.IO",
"System.Net",
"System.Diagnostics", // catches Process, ProcessStartInfo, EventLog, Trace/Debug file sinks
"System.Diagnostics", // catches Process, ProcessStartInfo, EventLog, Trace/Debug file sinks
"System.Reflection",
"System.Threading.Thread", // raw Thread — Tasks stay allowed (different namespace)
// System.Threading.Thread is NOT in this list: Thread's containing namespace is
// "System.Threading" (not "System.Threading.Thread"), so a prefix check on
// "System.Threading.Thread" never matches. Thread is denied type-granularly via
// ForbiddenFullTypeNames instead so the check actually fires.
"System.Threading.Tasks", // Task.Run / Parallel — scripts are synchronous predicates
// and have no legitimate need to start background work;
// a Task fan-out outlives the evaluation timeout entirely
// (Core.Scripting-003).
"System.Runtime.InteropServices",
"Microsoft.Win32", // registry
"Microsoft.Win32", // registry
];
/// <summary>
/// Fully-qualified type names scripts are NOT allowed to reference, regardless of
/// namespace. These types live directly in the allow-listed <c>System</c>
/// namespace (in <c>System.Private.CoreLib</c>), so a namespace-prefix rule cannot
/// reach them without also blocking primitives. Matched by exact fully-qualified
/// name against the resolved <em>type</em> symbol — every member of the type
/// (including read-only ones) is therefore rejected.
/// </summary>
/// <remarks>
/// <list type="bullet">
/// <item><c>System.Environment</c> — <c>Exit</c> / <c>FailFast</c> terminate
/// the host process; the whole type is denied (the read members have no
/// legitimate SCADA-predicate use either).</item>
/// <item><c>System.AppDomain</c> — process-wide assembly-load /
/// unhandled-exception control.</item>
/// <item><c>System.GC</c> — <c>Collect</c> / <c>AddMemoryPressure</c> perturb
/// the process memory subsystem.</item>
/// <item><c>System.Activator</c> — <c>CreateInstance</c> is a
/// reflection-equivalent escape that constructs a forbidden type by name
/// without ever naming it syntactically.</item>
/// <item><c>System.Threading.Thread</c> — raw thread creation bypasses the
/// per-evaluation timeout; denied type-granularly because its containing
/// namespace is <c>System.Threading</c> (shared with allowed types like
/// <c>CancellationToken</c>), so a namespace-prefix rule cannot reach it
/// without blocking unrelated types. (Core.Scripting-010.)</item>
/// </list>
/// </remarks>
public static readonly IReadOnlyList<string> ForbiddenFullTypeNames =
[
"System.Environment",
"System.AppDomain",
"System.GC",
"System.Activator",
// System.Threading.Thread lives in the System.Threading namespace (shared with
// CancellationToken, SemaphoreSlim, etc.), so a namespace-prefix deny-list cannot
// target it without blocking those legitimate types. Denied type-granularly here.
// (Core.Scripting-010.)
"System.Threading.Thread",
];
/// <summary>
@@ -58,6 +120,33 @@ public static class ForbiddenTypeAnalyzer
/// Returns empty list when the script is clean; non-empty list means the script
/// must be rejected at publish with the rejections surfaced to the operator.
/// </summary>
/// <remarks>
/// <para>
/// The walker has two passes per node. Pass (1) is the member / call surface:
/// <c>ObjectCreationExpressionSyntax</c>, <c>InvocationExpressionSyntax</c> with
/// a member-access target, <c>MemberAccessExpressionSyntax</c>, and bare
/// <c>IdentifierNameSyntax</c> are resolved via
/// <see cref="SemanticModel"/>.<c>GetSymbolInfo</c>. This catches static calls
/// (<c>System.IO.File.ReadAllText</c>) and constructors, and is deliberately
/// narrow: resolving <c>GetSymbolInfo</c> on <em>every</em> node would flag
/// harmless inherited members (e.g. <c>typeof(int).Name</c> resolves
/// <c>Name</c> to <c>System.Reflection.MemberInfo</c>, the base type that
/// declares it, even though the receiver type <c>System.Type</c> is allowed).
/// </para>
/// <para>
/// Pass (2) — the Core.Scripting-002 fix — resolves the <em>type</em> of every
/// <c>TypeSyntax</c> node via <c>GetTypeInfo</c>. The old walker only inspected
/// the four node kinds above, so a forbidden type named through
/// <c>typeof(System.IO.File)</c>, a generic argument
/// (<c>List&lt;System.IO.FileInfo&gt;</c>), a cast
/// (<c>(System.IO.Stream)null</c>), an <c>is</c> / <c>as</c> type pattern,
/// <c>default(System.Reflection.Assembly)</c>, an array-creation element type,
/// or an explicitly-typed local declaration produced no examined node and so
/// slipped through. Every <c>TypeSyntax</c> resolves to a concrete
/// <see cref="ITypeSymbol"/>; generic type arguments and array element types
/// are unwrapped recursively so a forbidden type nested at any depth is caught.
/// </para>
/// </remarks>
public static IReadOnlyList<ForbiddenTypeRejection> Analyze(Compilation compilation)
{
if (compilation is null) throw new ArgumentNullException(nameof(compilation));
@@ -69,6 +158,9 @@ public static class ForbiddenTypeAnalyzer
var root = tree.GetRoot();
foreach (var node in root.DescendantNodes())
{
// Pass (1) — member / call surface. Narrowly targeted at the node kinds
// that name a callable member or constructor, so inherited-member
// resolution does not produce false positives.
switch (node)
{
case ObjectCreationExpressionSyntax obj:
@@ -88,11 +180,43 @@ public static class ForbiddenTypeAnalyzer
CheckSymbol(semantic.GetSymbolInfo(id).Symbol, id.Span, rejections);
break;
}
// Pass (2) — type-reference surface (Core.Scripting-002). Every TypeSyntax
// resolves to the type it names, regardless of the syntactic form that
// introduced it (typeof operand, cast type, generic argument, default(T)
// operand, array element type, is/as pattern type, declared local type).
// Type arguments and array element types are walked recursively.
if (node is TypeSyntax)
CheckTypeSymbol(semantic.GetTypeInfo(node).Type, node.Span, rejections);
}
}
return rejections;
}
/// <summary>
/// Reject <paramref name="type"/> if it (or, recursively, any of its generic type
/// arguments / array element types) is forbidden. Walks the full type tree so a
/// forbidden type nested inside an allowed generic — e.g.
/// <c>List&lt;System.IO.FileInfo&gt;</c> — is still caught.
/// </summary>
private static void CheckTypeSymbol(ITypeSymbol? type, TextSpan span, List<ForbiddenTypeRejection> rejections)
{
if (type is null) return;
CheckSymbol(type, span, rejections);
switch (type)
{
case IArrayTypeSymbol array:
CheckTypeSymbol(array.ElementType, span, rejections);
break;
case INamedTypeSymbol named:
foreach (var arg in named.TypeArguments)
CheckTypeSymbol(arg, span, rejections);
break;
}
}
private static void CheckSymbol(ISymbol? symbol, TextSpan span, List<ForbiddenTypeRejection> rejections)
{
if (symbol is null) return;
@@ -107,17 +231,49 @@ public static class ForbiddenTypeAnalyzer
};
if (typeSymbol is null) return;
var typeName = typeSymbol.ToDisplayString();
// The broadened walk (Core.Scripting-002) resolves both GetSymbolInfo and
// GetTypeInfo on every node, so the same forbidden reference can be hit several
// times. Dedupe on span + type so the operator sees one rejection per offending
// reference, not a noisy pile of identical messages.
if (rejections.Any(r => r.Span == span && r.TypeName == typeName))
return;
var ns = typeSymbol.ContainingNamespace?.ToDisplayString() ?? string.Empty;
foreach (var forbidden in ForbiddenNamespacePrefixes)
{
if (ns == forbidden || ns.StartsWith(forbidden + ".", StringComparison.Ordinal))
{
rejections.Add(new ForbiddenTypeRejection(
Span: span,
TypeName: typeName,
Namespace: ns,
Message: $"Type '{typeName}' is in the forbidden namespace '{ns}'. " +
$"Scripts cannot reach {forbidden}* per Phase 7 sandbox rules."));
return;
}
}
// Type-granular deny-list — dangerous types that live in the allow-listed
// System namespace and so cannot be caught by ForbiddenNamespacePrefixes
// (Core.Scripting-001). Matched on the full type name; OriginalDefinition
// unwraps any generic construction before naming.
var fullTypeName = typeSymbol.OriginalDefinition.ToDisplayString(
SymbolDisplayFormat.FullyQualifiedFormat.WithGlobalNamespaceStyle(
SymbolDisplayGlobalNamespaceStyle.Omitted));
foreach (var forbiddenType in ForbiddenFullTypeNames)
{
if (fullTypeName == forbiddenType)
{
rejections.Add(new ForbiddenTypeRejection(
Span: span,
TypeName: typeSymbol.ToDisplayString(),
Namespace: ns,
Message: $"Type '{typeSymbol.ToDisplayString()}' is in the forbidden namespace '{ns}'. " +
$"Scripts cannot reach {forbidden}* per Phase 7 sandbox rules."));
Message: $"Type '{forbiddenType}' is on the Phase 7 sandbox forbidden-type " +
$"deny-list. Scripts cannot reach process-control types " +
$"(Environment / AppDomain / GC / Activator) even though they " +
$"live in the allowed 'System' namespace."));
return;
}
}
@@ -76,6 +76,14 @@ public sealed class TimedScriptEvaluator<TContext, TResult>
// WaitAsync's synthesized timeout — the inner task may still be running
// on its thread-pool thread (known leak documented in the class summary).
// Wrap so callers can distinguish from user-written timeout logic.
//
// The class docs guarantee "caller-supplied cancel wins over timeout".
// When both fire at nearly the same time, WaitAsync observes them in
// non-deterministic order, so a cancel that arrives a few µs after the
// timeout still reaches here as TimeoutException. Re-check the token so
// the guarantee holds regardless of race ordering. (Core.Scripting-007.)
if (ct.IsCancellationRequested)
throw new OperationCanceledException(ct);
throw new ScriptTimeoutException(Timeout);
}
}
@@ -31,6 +31,11 @@ public sealed class DependencyGraph
private readonly Dictionary<string, HashSet<string>> _dependsOn = new(StringComparer.Ordinal);
private readonly Dictionary<string, HashSet<string>> _dependents = new(StringComparer.Ordinal);
// Cached topological rank — built lazily by TransitiveDependentsInOrder and
// invalidated whenever the graph is mutated (Add / Clear). Avoids re-running
// a full O(V+E) Kahn pass on every change-cascade event.
private Dictionary<string, int>? _cachedRank;
/// <summary>
/// Register a node and the set of tags it depends on. Idempotent — re-adding
/// the same node overwrites the prior dependency set, so re-publishing an edited
@@ -58,6 +63,7 @@ public sealed class DependencyGraph
_dependents[dep] = set = new HashSet<string>(StringComparer.Ordinal);
set.Add(nodeId);
}
_cachedRank = null; // graph mutated — invalidate cached rank
}
/// <summary>Tag paths <paramref name="nodeId"/> directly reads.</summary>
@@ -84,9 +90,11 @@ public sealed class DependencyGraph
var result = new List<string>();
var visited = new HashSet<string>(StringComparer.Ordinal);
var order = TopologicalSort();
var rank = new Dictionary<string, int>(StringComparer.Ordinal);
for (var i = 0; i < order.Count; i++) rank[order[i]] = i;
// Reuse the cached rank to avoid an O(V+E) Kahn pass on every change event.
// The cache is invalidated whenever the graph is mutated (Add / Clear), so it
// is always consistent with the current graph structure.
var rank = GetOrBuildRank();
// DFS from the changed node collecting every reachable dependent.
var stack = new Stack<string>();
@@ -115,6 +123,16 @@ public sealed class DependencyGraph
return result;
}
private Dictionary<string, int> GetOrBuildRank()
{
if (_cachedRank is not null) return _cachedRank;
var order = TopologicalSort();
var rank = new Dictionary<string, int>(order.Count, StringComparer.Ordinal);
for (var i = 0; i < order.Count; i++) rank[order[i]] = i;
_cachedRank = rank;
return rank;
}
/// <summary>Iterable of every registered node id (inputs-only tags excluded).</summary>
public IReadOnlyCollection<string> RegisteredNodes => _dependsOn.Keys;
@@ -249,6 +267,7 @@ public sealed class DependencyGraph
{
_dependsOn.Clear();
_dependents.Clear();
_cachedRank = null; // graph cleared — invalidate cached rank
}
}
@@ -76,8 +76,15 @@ public sealed class VirtualTagEngine : IDisposable
_graph.Clear();
var compileFailures = new List<string>();
var seenPaths = new HashSet<string>(StringComparer.Ordinal);
foreach (var def in definitions)
{
if (!seenPaths.Add(def.Path))
{
compileFailures.Add($"{def.Path}: duplicate path — only one definition per path is allowed");
continue;
}
try
{
var extraction = DependencyExtractor.Extract(def.ScriptSource);
@@ -113,9 +120,10 @@ public sealed class VirtualTagEngine : IDisposable
// Subscribe to every referenced upstream path (driver tags only — virtual tags
// cascade internally). Seed the cache with current upstream values so first
// evaluations see something real.
var upstreamPaths = definitions
.SelectMany(d => _tags[d.Path].Reads)
// evaluations see something real. Iterate _tags.Values (the registered set) rather
// than definitions to avoid indexing by a raw input list that may contain duplicates.
var upstreamPaths = _tags.Values
.SelectMany(s => s.Reads)
.Where(p => !_tags.ContainsKey(p))
.Distinct(StringComparer.Ordinal);
foreach (var path in upstreamPaths)
@@ -229,12 +237,18 @@ public sealed class VirtualTagEngine : IDisposable
{
var ctxCache = BuildReadCache(state.Reads);
// Cold-start guard — hold the prior value when any upstream input is still
// unset or Bad-quality. Evaluating with nulls would throw inside the script
// (scripts cast ctx.GetTag(path).Value directly) and produce a persistent
// BadInternalError result until the upstream cache fills. Keeping the prior
// snapshot is more honest: the virtual tag simply hasn't been computed yet.
if (!AreInputsReady(ctxCache)) return;
// Cold-start guard — when any upstream input is still unset or Bad-quality,
// publish a BadWaitingForInitialData snapshot so OPC UA clients see a defined
// quality rather than observing "not yet computed" as a stale Good value.
// Evaluating with nulls would throw inside the script (scripts cast
// ctx.GetTag(path).Value directly) and produce a persistent BadInternalError.
if (!AreInputsReady(ctxCache))
{
var notReady = new DataValueSnapshot(null, 0x80320000u /* BadWaitingForInitialData */, null, _clock());
_valueCache[path] = notReady;
NotifyObservers(path, notReady);
return;
}
var context = new VirtualTagContext(
ctxCache,
@@ -247,7 +261,12 @@ public sealed class VirtualTagEngine : IDisposable
{
var raw = await state.Evaluator.RunAsync(context, ct).ConfigureAwait(false);
var coerced = CoerceResult(raw, state.Definition.DataType);
result = new DataValueSnapshot(coerced, 0u, _clock(), _clock());
// null from CoerceResult means the conversion threw (raw was non-null but
// not convertible to the declared type). Surface as BadInternalError so
// the OPC UA client sees a defined Bad quality rather than a Good null.
result = (raw is not null && coerced is null)
? new DataValueSnapshot(null, 0x80020000u /* BadInternalError */, null, _clock())
: new DataValueSnapshot(coerced, 0u, _clock(), _clock());
}
catch (ScriptTimeoutException tex)
{
@@ -315,6 +334,14 @@ public sealed class VirtualTagEngine : IDisposable
_valueCache[path] = snap;
NotifyObservers(path, snap);
if (_tags[path].Definition.Historize) _history.Record(path, snap);
// A cross-tag write must participate in the change-trigger cascade, exactly
// like an upstream delta — any change-triggered tag that reads this path
// would otherwise go stale until an unrelated trigger fires (see
// docs/VirtualTags.md, VirtualTagContext section). Fire-and-forget: this
// callback runs inside EvaluateInternalAsync with the non-reentrant
// _evalGate held, so the cascade must be scheduled, not invoked inline.
_ = CascadeAsync(path, CancellationToken.None);
}
private void NotifyObservers(string path, DataValueSnapshot value)
@@ -49,19 +49,20 @@ public sealed class VirtualTagSource : IReadable, ISubscribable
var handle = new SubscriptionHandle(Guid.NewGuid().ToString("N"));
var observers = new List<IDisposable>(fullReferences.Count);
foreach (var path in fullReferences)
{
observers.Add(_engine.Subscribe(path, (p, snap) =>
OnDataChange?.Invoke(this, new DataChangeEventArgs(handle, p, snap))));
}
_subs[handle.DiagnosticId] = new Subscription(handle, observers);
// OPC UA convention: emit initial-data callback for each path with the current value.
// OPC UA convention: for each path, emit the initial-data callback BEFORE
// registering the change observer. This prevents a race where an upstream change
// fires the observer between the Subscribe call and the Read call, which would
// deliver a newer change event before the initial-data event, leaving the client
// with a stale last-known value.
foreach (var path in fullReferences)
{
var snap = _engine.Read(path);
OnDataChange?.Invoke(this, new DataChangeEventArgs(handle, path, snap));
observers.Add(_engine.Subscribe(path, (p, s) =>
OnDataChange?.Invoke(this, new DataChangeEventArgs(handle, p, s))));
}
_subs[handle.DiagnosticId] = new Subscription(handle, observers);
return Task.FromResult<ISubscriptionHandle>(handle);
}
@@ -79,16 +79,15 @@ public sealed class PermissionTrie
private static void WalkSystemPlatform(PermissionTrieNode ns, NodeScope scope, HashSet<string> groups, List<MatchedGrant> matches)
{
// FolderSegments are nested under the namespace; each is its own trie level. Reuse the
// UnsArea scope kind for the flags — NodeAcl rows for Galaxy tags carry ScopeKind.Tag
// for leaf grants and ScopeKind.Namespace for folder-root grants; deeper folder grants
// are modeled as Equipment-level rows today since NodeAclScopeKind doesn't enumerate
// a dedicated FolderSegment kind. Future-proof TODO tracked in Stream B follow-up.
// FolderSegments are nested under the namespace; each is its own trie level. Use the
// dedicated FolderSegment scope kind so Galaxy folder grants report their true scope in
// AuthorizationDecision.Provenance — distinguishing them from UNS Equipment grants in
// the audit trail and Admin UI "Probe this permission" diagnostic.
var current = ns;
foreach (var segment in scope.FolderSegments)
{
if (!current.Children.TryGetValue(segment, out var child)) return;
CollectAtLevel(child, NodeAclScopeKind.Equipment, groups, matches);
CollectAtLevel(child, NodeAclScopeKind.FolderSegment, groups, matches);
current = child;
}
@@ -54,26 +54,51 @@ public sealed class PermissionTrieCache
/// <summary>
/// Retain only the most-recent <paramref name="keepLatest"/> generations for a cluster.
/// No-op when there's nothing to drop.
/// No-op when there's nothing to drop. Thread-safe: uses a CAS loop with
/// <see cref="ConcurrentDictionary{TKey,TValue}.TryUpdate"/> (reference equality on the
/// class-typed entry) so a concurrent <see cref="Install"/> on the same cluster is never
/// silently overwritten.
/// </summary>
public void Prune(string clusterId, int keepLatest = 3)
{
if (keepLatest < 1) throw new ArgumentOutOfRangeException(nameof(keepLatest), keepLatest, "keepLatest must be >= 1");
if (!_byCluster.TryGetValue(clusterId, out var entry)) return;
if (entry.Tries.Count <= keepLatest) return;
var keep = entry.Tries
.OrderByDescending(kvp => kvp.Key)
.Take(keepLatest)
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
_byCluster[clusterId] = new ClusterEntry(entry.Current, keep);
// CAS retry loop: read a snapshot, compute the pruned entry, atomically swap.
// Retry if another writer (Install or a concurrent Prune) updated the entry first.
while (true)
{
if (!_byCluster.TryGetValue(clusterId, out var observed)) return;
if (observed.Tries.Count <= keepLatest) return;
var keep = observed.Tries
.OrderByDescending(kvp => kvp.Key)
.Take(keepLatest)
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
// Preserve the current pointer; if it was pruned (shouldn't happen since Current
// is always the newest generation), fall back to the newest retained entry.
var current = keep.TryGetValue(observed.Current.GenerationId, out var kept)
? kept
: keep.OrderByDescending(kvp => kvp.Key).First().Value;
var pruned = new ClusterEntry(current, keep);
// TryUpdate uses reference equality for ClusterEntry (class, not record) so it
// succeeds only when the stored reference is still the one we observed.
if (_byCluster.TryUpdate(clusterId, pruned, observed))
return;
// Another thread updated the entry between our read and our write — re-read and retry.
}
}
/// <summary>Diagnostics counter: number of cached (cluster, generation) tries.</summary>
public int CachedTrieCount => _byCluster.Values.Sum(e => e.Tries.Count);
private sealed record ClusterEntry(PermissionTrie Current, IReadOnlyDictionary<long, PermissionTrie> Tries)
// Class (not record) so TryUpdate in Prune uses reference equality for the CAS comparison.
private sealed class ClusterEntry(PermissionTrie current, IReadOnlyDictionary<long, PermissionTrie> tries)
{
public PermissionTrie Current { get; } = current;
public IReadOnlyDictionary<long, PermissionTrie> Tries { get; } = tries;
public static ClusterEntry FromSingle(PermissionTrie trie) =>
new(trie, new Dictionary<long, PermissionTrie> { [trie.GenerationId] = trie });
@@ -37,6 +37,21 @@ public sealed class TriePermissionEvaluator : IPermissionEvaluator
var trie = _cache.GetTrie(scope.ClusterId);
if (trie is null) return AuthorizationDecision.NotGranted();
// Decision #153 / Phase 6.2 adversarial-review item #3 (redundancy-safe invalidation):
// the GetTrie shortcut returns whatever generation the cache currently holds, which may
// have advanced past the generation this session was bound to (another node published).
// Evaluate against the session's *bound* generation so a grant added or removed in a
// newer generation cannot silently take effect mid-session, and so the provenance in the
// AuthorizationDecision reports the generation that actually produced the verdict.
if (trie.GenerationId != session.AuthGenerationId)
{
trie = _cache.GetTrie(scope.ClusterId, session.AuthGenerationId);
// The session's bound generation has been pruned out of the cache — fail closed and
// force the caller to re-resolve the session's auth state before retrying.
if (trie is null) return AuthorizationDecision.NotGranted();
}
var matches = trie.CollectMatches(scope, session.LdapGroups);
if (matches.Count == 0) return AuthorizationDecision.NotGranted();
@@ -7,13 +7,19 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Authorization;
/// </summary>
/// <remarks>
/// Per decision #151 the membership is bounded by <see cref="MembershipFreshnessInterval"/>
/// (default 15 min). After that, the next hot-path authz call re-resolves LDAP group
/// (default 5 min). After that, the next hot-path authz call re-resolves LDAP group
/// memberships; failure to re-resolve (LDAP unreachable) flips the session to fail-closed
/// until a refresh succeeds.
///
/// Per decision #152 <see cref="AuthCacheMaxStaleness"/> (default 5 min) is separate from
/// Per decision #152 <see cref="AuthCacheMaxStaleness"/> (default 15 min) is separate from
/// Phase 6.1's availability-oriented 24h cache — beyond this window the evaluator returns
/// <see cref="AuthorizationVerdict.NotGranted"/> regardless of config-cache warmth.
///
/// The freshness window is the inner trigger and the staleness ceiling the outer hard
/// limit: <see cref="MembershipFreshnessInterval"/> MUST be strictly less than
/// <see cref="AuthCacheMaxStaleness"/> so that <see cref="NeedsRefresh"/> ("re-resolve
/// while still serving cached memberships") has a non-empty window before
/// <see cref="IsStale"/> fails the session closed.
/// </remarks>
public sealed record UserAuthorizationState
{
@@ -47,10 +53,10 @@ public sealed record UserAuthorizationState
public required long MembershipVersion { get; init; }
/// <summary>Bounded membership freshness window; past this the next authz call refreshes.</summary>
public TimeSpan MembershipFreshnessInterval { get; init; } = TimeSpan.FromMinutes(15);
public TimeSpan MembershipFreshnessInterval { get; init; } = TimeSpan.FromMinutes(5);
/// <summary>Hard staleness ceiling — beyond this, the evaluator fails closed.</summary>
public TimeSpan AuthCacheMaxStaleness { get; init; } = TimeSpan.FromMinutes(5);
public TimeSpan AuthCacheMaxStaleness { get; init; } = TimeSpan.FromMinutes(15);
/// <summary>
/// True when <paramref name="utcNow"/> - <see cref="MembershipResolvedUtc"/> exceeds
@@ -36,12 +36,25 @@ public class GenericDriverNodeManager(IDriver driver) : IDisposable
/// Populates the address space by streaming nodes from the driver into the supplied builder,
/// wraps the builder so alarm-condition sinks are captured, subscribes to the driver's
/// alarm event stream, and routes each transition to the matching sink by <c>SourceNodeId</c>.
/// Driver exceptions are isolated per decision #12 — the driver's subtree is marked Faulted,
/// but other drivers remain available.
/// If called a second time (e.g. Galaxy redeploy via <c>IRediscoverable.OnRediscoveryNeeded</c>)
/// the previous alarm subscription is torn down and the sink registry is cleared before
/// re-walking, preventing double delivery of alarm transitions.
/// Exception isolation (marking the driver's subtree Faulted) is the caller's responsibility —
/// exceptions from <see cref="ITagDiscovery.DiscoverAsync"/> propagate to the caller.
/// </summary>
public async Task BuildAddressSpaceAsync(IAddressSpaceBuilder builder, CancellationToken ct)
{
ArgumentNullException.ThrowIfNull(builder);
ObjectDisposedException.ThrowIf(_disposed, this);
// Tear down any previous alarm subscription before re-walking so a second call (e.g. on
// Galaxy redeploy) does not leave the old forwarder subscribed and double-fire events.
if (_alarmForwarder is not null && Driver is IAlarmSource existingSource)
{
existingSource.OnAlarmEvent -= _alarmForwarder;
_alarmForwarder = null;
}
_alarmSinks.Clear();
if (Driver is not ITagDiscovery discovery)
throw new NotSupportedException($"Driver '{Driver.DriverInstanceId}' does not implement ITagDiscovery.");
@@ -48,7 +48,9 @@ public sealed class AlarmSurfaceInvoker
/// <summary>
/// Subscribe to alarm events for a set of source node ids, fanning out by resolved host
/// so per-host breakers / bulkheads apply. Returns one handle per host — callers that
/// don't care about per-host separation may concatenate them.
/// don't care about per-host separation may concatenate them. Each returned handle wraps
/// the driver's opaque handle together with its resolved host so <see cref="UnsubscribeAsync"/>
/// routes through the same host's pipeline that the subscription was created on.
/// </summary>
public async Task<IReadOnlyList<IAlarmSubscriptionHandle>> SubscribeAsync(
IReadOnlyList<string> sourceNodeIds,
@@ -61,24 +63,34 @@ public sealed class AlarmSurfaceInvoker
var handles = new List<IAlarmSubscriptionHandle>(byHost.Count);
foreach (var (host, ids) in byHost)
{
var handle = await _invoker.ExecuteAsync(
var inner = await _invoker.ExecuteAsync(
DriverCapability.AlarmSubscribe,
host,
async ct => await _alarmSource.SubscribeAlarmsAsync(ids, ct).ConfigureAwait(false),
cancellationToken).ConfigureAwait(false);
handles.Add(handle);
handles.Add(new HostBoundHandle(inner, host));
}
return handles;
}
/// <summary>Cancel an alarm subscription. Routes through the AlarmSubscribe pipeline for parity.</summary>
/// <summary>
/// Cancel an alarm subscription. Routes through the same host's resilience pipeline
/// that the subscription was created on (carried in the <see cref="HostBoundHandle"/>
/// wrapper returned by <see cref="SubscribeAsync"/>). Falls back to the default host for
/// handles not created by this invoker so the method remains safe to call on any
/// <see cref="IAlarmSubscriptionHandle"/> implementation.
/// </summary>
public ValueTask UnsubscribeAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(handle);
var (innerHandle, host) = handle is HostBoundHandle bound
? (bound.Inner, bound.Host)
: (handle, _defaultHost);
return _invoker.ExecuteAsync(
DriverCapability.AlarmSubscribe,
_defaultHost,
async ct => await _alarmSource.UnsubscribeAlarmsAsync(handle, ct).ConfigureAwait(false),
host,
async ct => await _alarmSource.UnsubscribeAlarmsAsync(innerHandle, ct).ConfigureAwait(false),
cancellationToken);
}
@@ -126,4 +138,16 @@ public sealed class AlarmSurfaceInvoker
}
return result;
}
/// <summary>
/// Wraps an <see cref="IAlarmSubscriptionHandle"/> returned by the driver with the
/// resolved host name used when the subscription was created. <see cref="UnsubscribeAsync"/>
/// unwraps this to route the unsubscribe through the same host's resilience pipeline.
/// </summary>
private sealed class HostBoundHandle(IAlarmSubscriptionHandle inner, string host) : IAlarmSubscriptionHandle
{
public IAlarmSubscriptionHandle Inner { get; } = inner;
public string Host { get; } = host;
public string DiagnosticId => Inner.DiagnosticId;
}
}
@@ -56,4 +56,19 @@ public abstract class AbCipCommandBase : DriverCommandBase
/// multiple gateways in parallel can distinguish the logs.
/// </summary>
protected string DriverInstanceId => $"abcip-cli-{Gateway}";
/// <summary>
/// Guards against <see cref="AbCipDataType.Structure"/> being passed to a command
/// that does not support UDT layouts. Call at the top of <c>ExecuteAsync</c> for any
/// command that accepts <c>--type</c> but cannot handle memberless Structure tags.
/// Throws a <see cref="CliFx.Exceptions.CommandException"/> if <paramref name="type"/>
/// is <see cref="AbCipDataType.Structure"/>.
/// </summary>
protected static void RejectStructure(AbCipDataType type)
{
if (type == AbCipDataType.Structure)
throw new CliFx.Exceptions.CommandException(
"Structure (UDT) reads are out of scope for this command — those need an explicit " +
"member layout, which belongs in a real driver config.");
}
}
@@ -25,6 +25,7 @@ public sealed class ProbeCommand : AbCipCommandBase
public override async ValueTask ExecuteAsync(IConsole console)
{
ConfigureLogging();
RejectStructure(DataType);
var ct = console.RegisterCancellationHandler();
var probeTag = new AbCipTagDefinition(
@@ -27,6 +27,7 @@ public sealed class ReadCommand : AbCipCommandBase
public override async ValueTask ExecuteAsync(IConsole console)
{
ConfigureLogging();
RejectStructure(DataType);
var ct = console.RegisterCancellationHandler();
var tagName = SynthesiseTagName(TagPath, DataType);
@@ -30,6 +30,7 @@ public sealed class SubscribeCommand : AbCipCommandBase
public override async ValueTask ExecuteAsync(IConsole console)
{
ConfigureLogging();
RejectStructure(DataType);
var ct = console.RegisterCancellationHandler();
var tagName = ReadCommand.SynthesiseTagName(TagPath, DataType);
@@ -66,23 +66,40 @@ public sealed class WriteCommand : AbCipCommandBase
/// <summary>
/// Parse the operator's <c>--value</c> string into the CLR type the driver expects
/// for the declared <see cref="AbCipDataType"/>. Invariant culture everywhere.
/// Bad input (non-numeric text, out-of-range value) is caught and rethrown as a
/// <see cref="CliFx.Exceptions.CommandException"/> so CliFx renders a clean one-line
/// error rather than a full .NET stack trace.
/// </summary>
internal static object ParseValue(string raw, AbCipDataType type) => type switch
internal static object ParseValue(string raw, AbCipDataType type)
{
AbCipDataType.Bool => ParseBool(raw),
AbCipDataType.SInt => sbyte.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.Int => short.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.DInt or AbCipDataType.Dt => int.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.LInt => long.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.USInt => byte.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.UInt => ushort.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.UDInt => uint.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.ULInt => ulong.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.Real => float.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.LReal => double.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.String => raw,
_ => throw new CliFx.Exceptions.CommandException($"Unsupported DataType '{type}' for write."),
};
try
{
return type switch
{
AbCipDataType.Bool => ParseBool(raw),
AbCipDataType.SInt => sbyte.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.Int => short.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.DInt or AbCipDataType.Dt => int.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.LInt => long.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.USInt => byte.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.UInt => ushort.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.UDInt => uint.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.ULInt => ulong.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.Real => float.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.LReal => double.Parse(raw, CultureInfo.InvariantCulture),
AbCipDataType.String => raw,
_ => throw new CliFx.Exceptions.CommandException($"Unsupported DataType '{type}' for write."),
};
}
catch (Exception ex) when (ex is FormatException or OverflowException)
{
throw new CliFx.Exceptions.CommandException(
$"Cannot parse '{raw}' as {type}. " +
$"Check the value is within the valid range for {type} and uses invariant-culture " +
$"decimal notation (e.g. '3.14', not '3,14').",
innerException: ex);
}
}
private static bool ParseBool(string raw) => raw.Trim().ToLowerInvariant() switch
{
@@ -59,17 +59,38 @@ public sealed class WriteCommand : AbLegacyCommandBase
}
/// <summary>Parse <c>--value</c> per <see cref="AbLegacyDataType"/>, invariant culture.</summary>
internal static object ParseValue(string raw, AbLegacyDataType type) => type switch
/// <exception cref="CliFx.Exceptions.CommandException">
/// Thrown when <paramref name="raw"/> cannot be parsed as the requested type (malformed
/// input or out-of-range value) so CliFx renders a clean one-line error instead of a raw
/// stack trace.
/// </exception>
internal static object ParseValue(string raw, AbLegacyDataType type)
{
AbLegacyDataType.Bit => ParseBool(raw),
AbLegacyDataType.Int or AbLegacyDataType.AnalogInt => short.Parse(raw, CultureInfo.InvariantCulture),
AbLegacyDataType.Long => int.Parse(raw, CultureInfo.InvariantCulture),
AbLegacyDataType.Float => float.Parse(raw, CultureInfo.InvariantCulture),
AbLegacyDataType.String => raw,
AbLegacyDataType.TimerElement or AbLegacyDataType.CounterElement
or AbLegacyDataType.ControlElement => int.Parse(raw, CultureInfo.InvariantCulture),
_ => throw new CliFx.Exceptions.CommandException($"Unsupported DataType '{type}' for write."),
};
try
{
return type switch
{
AbLegacyDataType.Bit => ParseBool(raw),
AbLegacyDataType.Int or AbLegacyDataType.AnalogInt => short.Parse(raw, CultureInfo.InvariantCulture),
AbLegacyDataType.Long => int.Parse(raw, CultureInfo.InvariantCulture),
AbLegacyDataType.Float => float.Parse(raw, CultureInfo.InvariantCulture),
AbLegacyDataType.String => raw,
AbLegacyDataType.TimerElement or AbLegacyDataType.CounterElement
or AbLegacyDataType.ControlElement => int.Parse(raw, CultureInfo.InvariantCulture),
_ => throw new CliFx.Exceptions.CommandException($"Unsupported DataType '{type}' for write."),
};
}
catch (FormatException ex)
{
throw new CliFx.Exceptions.CommandException(
$"Value '{raw}' is not a valid {type}: {ex.Message}", innerException: ex);
}
catch (OverflowException ex)
{
throw new CliFx.Exceptions.CommandException(
$"Value '{raw}' is out of range for {type}: {ex.Message}", innerException: ex);
}
}
private static bool ParseBool(string raw) => raw.Trim().ToLowerInvariant() switch
{
@@ -7,8 +7,8 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Cli.Common;
/// <summary>
/// Shared base for every driver test-client command (Modbus / AB CIP / AB Legacy / S7 /
/// TwinCAT). Carries the options that are meaningful regardless of protocol — verbose
/// logging + the standard timeout — plus helpers every command implementation wants:
/// TwinCAT / FOCAS). Carries the options that are meaningful regardless of protocol —
/// verbose logging + the standard timeout — plus helpers every command implementation wants:
/// Serilog configuration + cancellation-token capture.
/// </summary>
/// <remarks>
@@ -44,17 +44,37 @@ public abstract class DriverCommandBase : ICommand
public abstract ValueTask ExecuteAsync(IConsole console);
/// <summary>
/// Configures the process-global Serilog logger. Commands call this at the top of
/// <see cref="ExecuteAsync"/> so driver-internal <c>Log.Logger</c> writes land on the
/// same sink as the CLI's operator-facing output.
/// Configures the process-global Serilog logger. Intended to be called exactly once,
/// at the top of <see cref="ExecuteAsync"/>, so driver-internal <c>Log.Logger</c>
/// writes land on the same sink as the CLI's operator-facing output.
/// If the logger has already been configured this call is a no-op (idempotent).
/// Call <see cref="FlushLogging"/> in a <c>finally</c> block to ensure buffered output
/// is flushed before the process exits.
/// </summary>
protected void ConfigureLogging()
{
if (_loggingConfigured) return;
_loggingConfigured = true;
// Dispose the previous global logger (e.g. Serilog's silent bootstrap logger) so
// its resources are released cleanly before we overwrite Log.Logger.
var previous = Log.Logger;
var config = new LoggerConfiguration();
if (Verbose)
config.MinimumLevel.Debug().WriteTo.Console();
else
config.MinimumLevel.Warning().WriteTo.Console();
Log.Logger = config.CreateLogger();
(previous as IDisposable)?.Dispose();
}
/// <summary>
/// Flushes and closes the Serilog logger configured by <see cref="ConfigureLogging"/>.
/// Call this in a <c>finally</c> block inside <see cref="ExecuteAsync"/> to prevent
/// buffered log output from being lost on process exit, particularly for long-running
/// commands such as <c>subscribe</c>.
/// </summary>
protected static void FlushLogging() => Log.CloseAndFlush();
private bool _loggingConfigured;
}
@@ -65,9 +65,9 @@ public static class SnapshotFormatter
Time = FormatTimestamp(snapshots[i].SourceTimestampUtc),
}).ToArray();
int tagW = Math.Max("TAG".Length, rows.Max(r => r.Tag.Length));
int valW = Math.Max("VALUE".Length, rows.Max(r => r.Value.Length));
int statW = Math.Max("STATUS".Length, rows.Max(r => r.Status.Length));
int tagW = rows.Length == 0 ? "TAG".Length : Math.Max("TAG".Length, rows.Max(r => r.Tag.Length));
int valW = rows.Length == 0 ? "VALUE".Length : Math.Max("VALUE".Length, rows.Max(r => r.Value.Length));
int statW = rows.Length == 0 ? "STATUS".Length : Math.Max("STATUS".Length, rows.Max(r => r.Status.Length));
// source-time column is fixed-width (ISO-8601 to ms) so no max-measurement needed.
var sb = new System.Text.StringBuilder();
@@ -100,23 +100,42 @@ public static class SnapshotFormatter
public static string FormatStatus(uint statusCode)
{
// Match the OPC UA shorthand for the statuses most-likely to land in a CLI run.
// Anything outside this short-list surfaces as hex — operators can cross-reference
// against OPC UA Part 6 § 7.34 (StatusCode tables) or Core.Abstractions status mappers.
var name = statusCode switch
// OPC UA status codes carry sub-code and flag bits in the low 16 bits (info type,
// structure-changed, semantics-changed, limit bits, overflow, etc.). To ensure
// that e.g. 0x80050001 still reads as "BadCommunicationError" rather than bare hex,
// named codes are matched against the high-word mask (code & 0xFFFF0000). When no
// named match is found the severity class (top 2 bits) provides a meaningful fallback
// so operators always see at least Good / Uncertain / Bad rather than raw hex.
// Numeric codes are the canonical values from the OPC Foundation Opc.Ua.StatusCodes
// table; keep them in sync with that table if this list is extended.
var masked = statusCode & 0xFFFF0000u;
var name = masked switch
{
0x00000000u => "Good",
0x80000000u => "Bad",
0x80050000u => "BadCommunicationError",
0x80060000u => "BadTimeout",
0x80070000u => "BadNoCommunication",
0x80080000u => "BadWaitingForInitialData",
0x800A0000u => "BadTimeout",
0x80310000u => "BadNoCommunication",
0x80320000u => "BadWaitingForInitialData",
0x80340000u => "BadNodeIdUnknown",
0x80350000u => "BadNodeIdInvalid",
0x80330000u => "BadNodeIdInvalid",
0x80740000u => "BadTypeMismatch",
0x40000000u => "Uncertain",
_ => null,
};
if (name is null)
{
// Severity fallback: top 2 bits identify the quality class even for unknown
// sub-codes. 0x80000000 and 0xC0000000 (reserved quality) both map to "Bad".
name = (statusCode & 0xC0000000u) switch
{
0x00000000u => "Good",
0x40000000u => "Uncertain",
_ => "Bad",
};
}
return name is null
? $"0x{statusCode:X8}"
: $"0x{statusCode:X8} ({name})";
@@ -35,6 +35,21 @@ public sealed class SubscribeCommand : ModbusCommandBase
"BigEndian (default) or WordSwap.")]
public ModbusByteOrder ByteOrder { get; init; } = ModbusByteOrder.BigEndian;
// Driver.Modbus.Cli-001: subscribe previously lacked these three options that read and
// write both expose. Without them, BitInRegister always watches bit 0 and String runs with
// StringLength=0, silently producing wrong results for any subscriber using those types.
[CommandOption("bit-index", Description =
"For type=BitInRegister: which bit of the holding register (0-15, LSB-first).")]
public byte BitIndex { get; init; }
[CommandOption("string-length", Description =
"For type=String: character count (2 per register, rounded up).")]
public ushort StringLength { get; init; }
[CommandOption("string-byte-order", Description =
"For type=String: HighByteFirst (standard) or LowByteFirst (DirectLOGIC).")]
public ModbusStringByteOrder StringByteOrder { get; init; } = ModbusStringByteOrder.HighByteFirst;
public override async ValueTask ExecuteAsync(IConsole console)
{
ConfigureLogging();
@@ -47,7 +62,10 @@ public sealed class SubscribeCommand : ModbusCommandBase
Address: Address,
DataType: DataType,
Writable: false,
ByteOrder: ByteOrder);
ByteOrder: ByteOrder,
BitIndex: BitIndex,
StringLength: StringLength,
StringByteOrder: StringByteOrder);
var options = BuildOptions([tag]);
await using var driver = new ModbusDriver(options, DriverInstanceId);
@@ -60,6 +60,16 @@ public sealed class WriteCommand : ModbusCommandBase
throw new CliFx.Exceptions.CommandException(
$"Region '{Region}' is read-only in the Modbus spec; writes require Coils or HoldingRegisters.");
// Driver.Modbus.Cli-002: coils are single-bit outputs — only Bool makes sense. A
// non-boolean type (e.g. --region Coils --type UInt16) would silently coerce the value
// to a boolean via Convert.ToBoolean, landing as ON for any non-zero value, with no
// diagnostic. Reject it early so the operator sees a clear error rather than a silent
// type-mismatch coerce.
if (Region == ModbusRegion.Coils && DataType != ModbusDataType.Bool)
throw new CliFx.Exceptions.CommandException(
$"Region 'Coils' only supports boolean values (--type Bool). " +
$"Type '{DataType}' cannot represent a single-bit coil write.");
var tagName = ReadCommand.SynthesiseTagName(Region, Address, DataType);
var tag = new ModbusTagDefinition(
Name: tagName,
@@ -34,6 +34,10 @@ public sealed class ProbeCommand : S7CommandBase
var options = BuildOptions([probeTag]);
await using var driver = new S7Driver(options, DriverInstanceId);
// Driver.S7.Cli-003: wrap the entire probe sequence so that a refused/unreachable TCP
// connect still prints the structured Host/CPU/Health lines instead of crashing with a
// full .NET stack trace. InitializeAsync sets health to Faulted with the exception
// message before re-throwing, so GetHealth() always has something to report.
try
{
await driver.InitializeAsync("{}", ct);
@@ -48,6 +52,20 @@ public sealed class ProbeCommand : S7CommandBase
await console.Output.WriteLineAsync();
await console.Output.WriteLineAsync(SnapshotFormatter.Format(Address, snapshot[0]));
}
catch (OperationCanceledException)
{
throw; // Ctrl+C — let CliFx handle it normally.
}
catch
{
// Connect / read failure — print what the driver knows so far.
var health = driver.GetHealth();
await console.Output.WriteLineAsync($"Host: {Host}:{Port}");
await console.Output.WriteLineAsync($"CPU: {CpuType} rack={Rack} slot={Slot}");
await console.Output.WriteLineAsync($"Health: {health.State}");
if (health.LastError is { } err)
await console.Output.WriteLineAsync($"Last error: {err}");
}
finally
{
await driver.ShutdownAsync(CancellationToken.None);
@@ -19,9 +19,12 @@ public sealed class ReadCommand : S7CommandBase
IsRequired = true)]
public string Address { get; init; } = default!;
// Driver.S7.Cli-002: help text trimmed to the types the driver actually implements.
// Int64 / UInt64 / Float64 / String / DateTime are defined in S7DataType but the driver
// raises NotSupportedException (→ BadNotSupported) on reads of those types.
[CommandOption("type", 't', Description =
"Bool / Byte / Int16 / UInt16 / Int32 / UInt32 / Int64 / UInt64 / Float32 / Float64 / " +
"String / DateTime (default Int16).")]
"Bool / Byte / Int16 / UInt16 / Int32 / UInt32 / Float32 (default Int16). " +
"Int64, UInt64, Float64, String, and DateTime are not yet implemented and will return BadNotSupported.")]
public S7DataType DataType { get; init; } = S7DataType.Int16;
[CommandOption("string-length", Description =
@@ -15,9 +15,10 @@ public sealed class SubscribeCommand : S7CommandBase
[CommandOption("address", 'a', Description = "S7 address — same format as `read`.", IsRequired = true)]
public string Address { get; init; } = default!;
// Driver.S7.Cli-002: help text trimmed to the types the driver actually implements.
[CommandOption("type", 't', Description =
"Bool / Byte / Int16 / UInt16 / Int32 / UInt32 / Int64 / UInt64 / Float32 / Float64 / " +
"String / DateTime (default Int16).")]
"Bool / Byte / Int16 / UInt16 / Int32 / UInt32 / Float32 (default Int16). " +
"Int64, UInt64, Float64, String, and DateTime are not yet implemented and will return BadNotSupported.")]
public S7DataType DataType { get; init; } = S7DataType.Int16;
[CommandOption("interval-ms", 'i', Description = "Publishing interval ms (default 1000).")]
@@ -18,9 +18,13 @@ public sealed class WriteCommand : S7CommandBase
"S7 address — same format as `read`.", IsRequired = true)]
public string Address { get; init; } = default!;
// Driver.S7.Cli-002: help text trimmed to the types the driver actually implements.
// Int64 / UInt64 / Float64 / String / DateTime are defined in S7DataType but the driver
// raises NotSupportedException (→ BadNotSupported) on any read/write of those types;
// advertising them misleads operators who then see BadNotSupported with no explanation.
[CommandOption("type", 't', Description =
"Bool / Byte / Int16 / UInt16 / Int32 / UInt32 / Int64 / UInt64 / Float32 / Float64 / " +
"String / DateTime (default Int16).")]
"Bool / Byte / Int16 / UInt16 / Int32 / UInt32 / Float32 (default Int16). " +
"Int64, UInt64, Float64, String, and DateTime are not yet implemented and will return BadNotSupported.")]
public S7DataType DataType { get; init; } = S7DataType.Int16;
[CommandOption("value", 'v', Description =
@@ -62,22 +66,44 @@ public sealed class WriteCommand : S7CommandBase
}
/// <summary>Parse <c>--value</c> per <see cref="S7DataType"/>, invariant culture throughout.</summary>
internal static object ParseValue(string raw, S7DataType type) => type switch
/// <remarks>
/// Driver.S7.Cli-001: numeric and <see cref="DateTime"/> parses are wrapped so that
/// malformed input (<see cref="FormatException"/> / <see cref="OverflowException"/>)
/// surfaces as a clean <see cref="CliFx.Exceptions.CommandException"/> rather than a
/// raw .NET stack trace — matching the friendly message the Bool path already produces.
/// </remarks>
internal static object ParseValue(string raw, S7DataType type)
{
S7DataType.Bool => ParseBool(raw),
S7DataType.Byte => byte.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Int16 => short.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.UInt16 => ushort.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Int32 => int.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.UInt32 => uint.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Int64 => long.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.UInt64 => ulong.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Float32 => float.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Float64 => double.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.String => raw,
S7DataType.DateTime => DateTime.Parse(raw, CultureInfo.InvariantCulture, DateTimeStyles.RoundtripKind),
_ => throw new CliFx.Exceptions.CommandException($"Unsupported DataType '{type}' for write."),
};
if (type == S7DataType.Bool) return ParseBool(raw);
if (type == S7DataType.String) return raw;
try
{
return type switch
{
S7DataType.Byte => (object)byte.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Int16 => (object)short.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.UInt16 => (object)ushort.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Int32 => (object)int.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.UInt32 => (object)uint.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Int64 => (object)long.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.UInt64 => (object)ulong.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Float32 => (object)float.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.Float64 => (object)double.Parse(raw, CultureInfo.InvariantCulture),
S7DataType.DateTime => (object)DateTime.Parse(raw, CultureInfo.InvariantCulture, DateTimeStyles.RoundtripKind),
_ => throw new CliFx.Exceptions.CommandException($"Unsupported DataType '{type}' for write."),
};
}
catch (FormatException ex)
{
throw new CliFx.Exceptions.CommandException(
$"Value '{raw}' is not a valid {type}: {ex.Message}");
}
catch (OverflowException ex)
{
throw new CliFx.Exceptions.CommandException(
$"Value '{raw}' is out of range for {type}: {ex.Message}");
}
}
private static bool ParseBool(string raw) => raw.Trim().ToLowerInvariant() switch
{
@@ -40,22 +40,29 @@ public enum AbCipDataType
public static class AbCipDataTypeExtensions
{
/// <summary>
/// Map to the driver-agnostic type the server's address-space builder consumes. Unsigned
/// Logix types widen into signed equivalents until <c>DriverDataType</c> picks up unsigned
/// + 64-bit variants (Modbus has the same gap — see <c>ModbusDriver.MapDataType</c>
/// comment re: PR 25).
/// Map to the driver-agnostic type the server's address-space builder consumes.
/// <c>DriverDataType</c> carries Int64, UInt32, and UInt64 so each Logix type maps
/// to the widest correct signed/unsigned equivalent without silent truncation:
/// <list type="bullet">
/// <item>LInt (signed 64-bit) → Int64; ULInt (unsigned 64-bit) → UInt64.</item>
/// <item>UDInt (unsigned 32-bit) → UInt32 so values above Int32.MaxValue are not
/// wrapped to negative (Driver.AbCip-004).</item>
/// <item>USInt / UInt widen into Int32; they can never overflow it.</item>
/// </list>
/// </summary>
public static DriverDataType ToDriverDataType(this AbCipDataType t) => t switch
{
AbCipDataType.Bool => DriverDataType.Boolean,
AbCipDataType.SInt or AbCipDataType.Int or AbCipDataType.DInt => DriverDataType.Int32,
AbCipDataType.USInt or AbCipDataType.UInt or AbCipDataType.UDInt => DriverDataType.Int32,
AbCipDataType.LInt or AbCipDataType.ULInt => DriverDataType.Int32, // TODO: Int64 — matches Modbus gap
AbCipDataType.USInt or AbCipDataType.UInt => DriverDataType.Int32,
AbCipDataType.UDInt => DriverDataType.UInt32,
AbCipDataType.LInt => DriverDataType.Int64,
AbCipDataType.ULInt => DriverDataType.UInt64,
AbCipDataType.Real => DriverDataType.Float32,
AbCipDataType.LReal => DriverDataType.Float64,
AbCipDataType.String => DriverDataType.String,
AbCipDataType.Dt => DriverDataType.Int32, // epoch-seconds DINT
AbCipDataType.Structure => DriverDataType.String, // placeholder until UDT PR 6 introduces a structured kind
AbCipDataType.Structure => DriverDataType.String, // placeholder until UDT introduces a structured kind
_ => DriverDataType.Int32,
};
}
@@ -5,9 +5,8 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip;
/// <summary>
/// Allen-Bradley CIP / EtherNet-IP driver for ControlLogix / CompactLogix / Micro800 /
/// GuardLogix families. Implements <see cref="IDriver"/> only for now — read/write/
/// subscribe/discover capabilities ship in subsequent PRs (38) and family-specific quirk
/// profiles ship in PRs 912.
/// GuardLogix families. Implements all read/write/subscribe/discover/probe/alarm
/// capabilities via the libplctag.NET wrapper.
/// </summary>
/// <remarks>
/// <para>Wire layer is libplctag 1.6.x (plan decision #11). Per-device host addresses use
@@ -17,13 +16,16 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip;
///
/// <para>Tier A per plan decisions #143145 — in-process, shares server lifetime, no
/// sidecar. <see cref="ReinitializeAsync"/> is the Tier-B escape hatch for recovering
/// from native-heap growth that the CLR allocator can't see; it tears down every
/// <see cref="PlcTagHandle"/> and reconnects each device.</para>
/// from native-heap growth that the CLR allocator can't see; it tears down the
/// libplctag.NET <c>Tag</c> instances held in <c>DeviceState.Runtimes</c> and reconnects
/// each device. Native tag lifetime is owned by the libplctag.NET <c>Tag.Dispose()</c>
/// (called in <see cref="DeviceState.DisposeHandles"/>); the library's own finalizer
/// handles GC-collected tags.</para>
/// </remarks>
public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery, ISubscribable,
IHostConnectivityProbe, IPerCallHostResolver, IAlarmSource, IDisposable, IAsyncDisposable
{
private readonly AbCipDriverOptions _options;
private AbCipDriverOptions _options;
private readonly string _driverInstanceId;
private readonly IAbCipTagFactory _tagFactory;
private readonly IAbCipTagEnumeratorFactory _enumeratorFactory;
@@ -32,7 +34,7 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
private readonly PollGroupEngine _poll;
private readonly Dictionary<string, DeviceState> _devices = new(StringComparer.OrdinalIgnoreCase);
private readonly Dictionary<string, AbCipTagDefinition> _tagsByName = new(StringComparer.OrdinalIgnoreCase);
private readonly AbCipAlarmProjection _alarmProjection;
private AbCipAlarmProjection _alarmProjection;
private DriverHealth _health = new(DriverState.Unknown, null, null);
public event EventHandler<DataChangeEventArgs>? OnDataChange;
@@ -108,11 +110,32 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
public string DriverInstanceId => _driverInstanceId;
public string DriverType => "AbCip";
/// <summary>
/// Initialize the driver from its <c>DriverConfig</c> JSON. When
/// <paramref name="driverConfigJson"/> carries a real configuration (any device or tag),
/// it is parsed via <see cref="AbCipDriverFactoryExtensions.ParseOptions"/> and the
/// parsed options REPLACE the construction-time options — this is what makes
/// <see cref="ReinitializeAsync"/> pick up a changed config (new device, new tag,
/// changed timeout). A blank or empty-object JSON (<c>"{}"</c>) is treated as "no
/// override" so callers that constructed the driver with explicit options — chiefly
/// unit tests — keep those options. The driver's address-space + runtime state is then
/// built from the effective <see cref="_options"/>.
/// </summary>
public Task InitializeAsync(string driverConfigJson, CancellationToken cancellationToken)
{
_health = new DriverHealth(DriverState.Initializing, null, null);
try
{
if (!string.IsNullOrWhiteSpace(driverConfigJson))
{
var parsed = AbCipDriverFactoryExtensions.ParseOptions(_driverInstanceId, driverConfigJson);
if (parsed.Devices.Count > 0 || parsed.Tags.Count > 0)
{
_options = parsed;
_alarmProjection = new AbCipAlarmProjection(this, _options.AlarmPollInterval);
}
}
foreach (var device in _options.Devices)
{
var addr = AbCipHostAddress.TryParse(device.HostAddress)
@@ -123,7 +146,16 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
}
foreach (var tag in _options.Tags)
{
// Duplicate-key check: a collision means two configured tags have the same name.
// Fail fast at init time with a diagnostic rather than silently clobbering.
// (Driver.AbCip-005)
if (_tagsByName.TryGetValue(tag.Name, out var existingTag))
throw new InvalidOperationException(
$"AbCip tag name collision: '{tag.Name}' is declared more than once. " +
$"Existing entry DeviceHostAddress='{existingTag.DeviceHostAddress}', " +
$"TagPath='{existingTag.TagPath}'. Rename or remove the duplicate.");
_tagsByName[tag.Name] = tag;
if (tag.DataType == AbCipDataType.Structure && tag.Members is { Count: > 0 })
{
foreach (var member in tag.Members)
@@ -135,6 +167,14 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
DataType: member.DataType,
Writable: member.Writable,
WriteIdempotent: member.WriteIdempotent);
// Member fan-out duplicate check: a member-path collision means two
// configured structure tags produce the same member path, or a member
// name collides with an independently-declared tag.
if (_tagsByName.TryGetValue(memberTag.Name, out var existingMember))
throw new InvalidOperationException(
$"AbCip tag name collision: '{memberTag.Name}' is produced by both " +
$"'{tag.Name}.{member.Name}' (member fan-out) and an existing tag " +
$"'{existingMember.Name}'. Rename one of the configured tags to resolve.");
_tagsByName[memberTag.Name] = memberTag;
}
}
@@ -147,7 +187,9 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
{
state.ProbeCts = new CancellationTokenSource();
var ct = state.ProbeCts.Token;
_ = Task.Run(() => ProbeLoopAsync(state, ct), ct);
// Keep the loop Task so ShutdownAsync can await its clean exit before
// disposing the CTS / handles the loop is still using (Driver.AbCip-008).
state.ProbeTask = Task.Run(() => ProbeLoopAsync(state, ct), ct);
}
}
_health = new DriverHealth(DriverState.Healthy, DateTime.UtcNow, null);
@@ -166,15 +208,46 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
await InitializeAsync(driverConfigJson, cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Tear the driver down: stop the alarm projection + poll engine, then for each device
/// cancel its probe loop, <em>await the loop's clean exit</em>, and only then dispose
/// the probe CTS + runtime handles. Awaiting the probe Task before disposing closes the
/// race where a still-running loop touches a disposed CTS or a cleared runtime
/// dictionary (Driver.AbCip-008). Idempotent — safe to call twice (e.g. ShutdownAsync
/// from ReinitializeAsync followed by DisposeAsync).
/// </summary>
public async Task ShutdownAsync(CancellationToken cancellationToken)
{
await _alarmProjection.DisposeAsync().ConfigureAwait(false);
await _poll.DisposeAsync().ConfigureAwait(false);
// Phase 1: signal every probe loop to stop.
foreach (var state in _devices.Values)
{
try { state.ProbeCts?.Cancel(); } catch (ObjectDisposedException) { }
}
// Phase 2: wait for each probe loop to observe cancellation and exit. The loop never
// throws on cancellation (it catches OperationCanceledException internally), but guard
// anyway so one slow device can't wedge the whole shutdown.
foreach (var state in _devices.Values)
{
var probeTask = state.ProbeTask;
if (probeTask is null) continue;
try
{
await probeTask.WaitAsync(TimeSpan.FromSeconds(10), cancellationToken).ConfigureAwait(false);
}
catch (TimeoutException) { }
catch (OperationCanceledException) { }
}
// Phase 3: now the loops are gone, dispose the CTS + native handles with no live reader.
foreach (var state in _devices.Values)
{
try { state.ProbeCts?.Cancel(); } catch { }
state.ProbeCts?.Dispose();
state.ProbeCts = null;
state.ProbeTask = null;
state.DisposeHandles();
}
_devices.Clear();
@@ -316,7 +389,7 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
/// <summary>
/// Read each <c>fullReference</c> in order. Unknown tags surface as
/// <c>BadNodeIdUnknown</c>; libplctag-layer failures map through
/// <see cref="AbCipStatusMapper.MapLibplctagStatus"/>; any other exception becomes
/// <see cref="AbCipStatusMapper.MapLibplctagStatus(int)"/>; any other exception becomes
/// <c>BadCommunicationError</c>. The driver health surface is updated per-call so the
/// Admin UI sees a tight feedback loop between read failures + the driver's state.
/// </summary>
@@ -331,8 +404,12 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
// whole-UDT read + in-memory member decode; every other reference falls back to the
// per-tag path that's been here since PR 3. Planner is a pure function over the
// current tag map; BOOL/String/Structure members stay on the fallback path because
// declaration-only offsets can't place them under Logix alignment rules.
var plan = AbCipUdtReadPlanner.Build(fullReferences, _tagsByName);
// declaration-only offsets can't place them under Logix alignment rules. Whole-UDT
// grouping is itself gated behind EnableDeclarationOnlyUdtGrouping — Studio 5000 may
// reorder UDT members vs declaration order, so the fast path is opt-in only (see
// Driver.AbCip-003 / AbCipUdtMemberLayout remarks).
var plan = AbCipUdtReadPlanner.Build(
fullReferences, _tagsByName, _options.EnableDeclarationOnlyUdtGrouping);
foreach (var group in plan.Groups)
await ReadGroupAsync(group, results, now, cancellationToken).ConfigureAwait(false);
@@ -351,6 +428,15 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
results[fb.OriginalIndex] = new DataValueSnapshot(null, AbCipStatusMapper.BadNodeIdUnknown, null, now);
return;
}
// Driver.AbCip-005: a Structure tag whose Members are declared is a container —
// its bare name is readable via the whole-UDT grouping path (ReadGroupAsync), not the
// per-tag path. Reading it here returns BadNotSupported rather than Good/null so the
// caller knows to address individual member paths (e.g. "Motor.Speed").
if (def.DataType == AbCipDataType.Structure && def.Members is { Count: > 0 })
{
results[fb.OriginalIndex] = new DataValueSnapshot(null, AbCipStatusMapper.BadNotSupported, null, now);
return;
}
if (!_devices.TryGetValue(def.DeviceHostAddress, out var device))
{
results[fb.OriginalIndex] = new DataValueSnapshot(null, AbCipStatusMapper.BadNodeIdUnknown, null, now);
@@ -365,6 +451,11 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
var status = runtime.GetStatus();
if (status != 0)
{
// Evict the stale handle so the next call re-creates it (Driver.AbCip-010).
// A non-zero status can mean the controller dropped the connection or the tag
// handle became permanently invalid (e.g. after a PLC download). Evicting
// mirrors the probe loop's recreate-on-failure behaviour.
EvictRuntime(device, def.Name);
results[fb.OriginalIndex] = new DataValueSnapshot(null,
AbCipStatusMapper.MapLibplctagStatus(status), null, now);
_health = new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead,
@@ -384,6 +475,8 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
}
catch (Exception ex)
{
// Transport exception — evict so the next read creates a fresh handle.
EvictRuntime(device, def.Name);
results[fb.OriginalIndex] = new DataValueSnapshot(null,
AbCipStatusMapper.BadCommunicationError, null, now);
_health = new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead, ex.Message);
@@ -416,6 +509,7 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
var status = runtime.GetStatus();
if (status != 0)
{
EvictRuntime(device, parent.Name); // Driver.AbCip-010
var mapped = AbCipStatusMapper.MapLibplctagStatus(status);
StampGroupStatus(group, results, now, mapped);
_health = new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead,
@@ -436,6 +530,7 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
}
catch (Exception ex)
{
EvictRuntime(device, parent.Name); // Driver.AbCip-010
StampGroupStatus(group, results, now, AbCipStatusMapper.BadCommunicationError);
_health = new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead, ex.Message);
}
@@ -506,10 +601,16 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
await runtime.WriteAsync(cancellationToken).ConfigureAwait(false);
var status = runtime.GetStatus();
results[i] = new WriteResult(status == 0
? AbCipStatusMapper.Good
: AbCipStatusMapper.MapLibplctagStatus(status));
if (status == 0) _health = new DriverHealth(DriverState.Healthy, now, null);
if (status != 0)
{
EvictRuntime(device, def.Name); // Driver.AbCip-010
results[i] = new WriteResult(AbCipStatusMapper.MapLibplctagStatus(status));
}
else
{
results[i] = new WriteResult(AbCipStatusMapper.Good);
_health = new DriverHealth(DriverState.Healthy, now, null);
}
}
catch (OperationCanceledException)
{
@@ -517,11 +618,13 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
}
catch (NotSupportedException nse)
{
// Type/protocol error — not a transport fault; don't evict the handle.
results[i] = new WriteResult(AbCipStatusMapper.BadNotSupported);
_health = new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead, nse.Message);
}
catch (FormatException fe)
{
// Value conversion error — not a transport fault; don't evict.
results[i] = new WriteResult(AbCipStatusMapper.BadTypeMismatch);
_health = new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead, fe.Message);
}
@@ -537,6 +640,8 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
}
catch (Exception ex)
{
// Transport / wire error — evict so the next write creates a fresh handle.
EvictRuntime(device, def.Name); // Driver.AbCip-010
results[i] = new WriteResult(AbCipStatusMapper.BadCommunicationError);
_health = new DriverHealth(DriverState.Degraded, _health.LastSuccessfulRead, ex.Message);
}
@@ -609,8 +714,12 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
runtime.Dispose();
throw;
}
device.ParentRuntimes[parentTagName] = runtime;
return runtime;
// Two concurrent callers can both miss the cache + both initialize a runtime; only the
// first TryAdd wins. Dispose the loser so it doesn't leak a native tag handle.
if (device.ParentRuntimes.TryAdd(parentTagName, runtime))
return runtime;
runtime.Dispose();
return device.ParentRuntimes[parentTagName];
}
/// <summary>
@@ -643,8 +752,27 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
runtime.Dispose();
throw;
}
device.Runtimes[def.Name] = runtime;
return runtime;
// Two concurrent callers can both miss the cache + both initialize a runtime; only the
// first TryAdd wins. Dispose the loser so it doesn't leak a native tag handle.
if (device.Runtimes.TryAdd(def.Name, runtime))
return runtime;
runtime.Dispose();
return device.Runtimes[def.Name];
}
/// <summary>
/// Evict the runtime for <paramref name="tagName"/> from the device's cache and dispose
/// it so the next read/write call re-creates and re-initializes a fresh handle.
/// Called from <see cref="ReadSingleAsync"/>, <see cref="ReadGroupAsync"/>, and
/// <see cref="WriteAsync"/> after a non-zero libplctag status or transport exception —
/// mirroring the probe loop's recreate-on-failure behaviour (Driver.AbCip-010).
/// </summary>
private static void EvictRuntime(DeviceState device, string tagName)
{
if (device.Runtimes.TryRemove(tagName, out var stale))
{
try { stale.Dispose(); } catch { }
}
}
public DriverHealth GetHealth() => _health;
@@ -785,8 +913,10 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
/// <summary>
/// Per-device runtime state. Holds the parsed host address, family profile, and the
/// live <see cref="PlcTagHandle"/> cache keyed by tag path. PRs 38 populate + consume
/// this dict via libplctag.
/// live libplctag.NET <see cref="IAbCipTagRuntime"/> instances keyed by tag name.
/// Native tag lifetime is owned by the <c>Tag.Dispose()</c> inside each
/// <see cref="LibplctagTagRuntime"/>; libplctag.NET's own finalizer covers GC-collected
/// instances so no separate SafeHandle wrapper is needed here (Driver.AbCip-006).
/// </summary>
internal sealed class DeviceState(
AbCipHostAddress parsedAddress,
@@ -803,14 +933,23 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
public CancellationTokenSource? ProbeCts { get; set; }
public bool ProbeInitialized { get; set; }
public Dictionary<string, PlcTagHandle> TagHandles { get; } =
new(StringComparer.OrdinalIgnoreCase);
/// <summary>
/// The fire-and-forget probe loop's <see cref="Task"/>. Stored so
/// <see cref="AbCipDriver.ShutdownAsync"/> can await the loop's clean exit after
/// cancelling <see cref="ProbeCts"/> and BEFORE disposing the CTS or the runtime
/// handles — otherwise the still-running loop can touch a disposed CTS or a cleared
/// runtime dictionary (Driver.AbCip-008).
/// </summary>
public Task? ProbeTask { get; set; }
/// <summary>
/// Per-tag runtime handles owned by this device. One entry per configured tag is
/// created lazily on first read (see <see cref="AbCipDriver.EnsureTagRuntimeAsync"/>).
/// <see cref="System.Collections.Concurrent.ConcurrentDictionary{TKey,TValue}"/>
/// because <c>ReadAsync</c> is invoked concurrently by the server read path, every
/// polled subscription loop, and the alarm projection loop.
/// </summary>
public Dictionary<string, IAbCipTagRuntime> Runtimes { get; } =
public System.Collections.Concurrent.ConcurrentDictionary<string, IAbCipTagRuntime> Runtimes { get; } =
new(StringComparer.OrdinalIgnoreCase);
/// <summary>
@@ -819,7 +958,7 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
/// bit-selector tag name ("Motor.Flags.3") needs a distinct handle from the DINT
/// parent ("Motor.Flags") used to do the read + write.
/// </summary>
public Dictionary<string, IAbCipTagRuntime> ParentRuntimes { get; } =
public System.Collections.Concurrent.ConcurrentDictionary<string, IAbCipTagRuntime> ParentRuntimes { get; } =
new(StringComparer.OrdinalIgnoreCase);
private readonly System.Collections.Concurrent.ConcurrentDictionary<string, SemaphoreSlim> _rmwLocks = new();
@@ -829,8 +968,6 @@ public sealed class AbCipDriver : IDriver, IReadable, IWritable, ITagDiscovery,
public void DisposeHandles()
{
foreach (var h in TagHandles.Values) h.Dispose();
TagHandles.Clear();
foreach (var r in Runtimes.Values) r.Dispose();
Runtimes.Clear();
foreach (var r in ParentRuntimes.Values) r.Dispose();
@@ -21,6 +21,20 @@ public static class AbCipDriverFactoryExtensions
}
internal static AbCipDriver CreateInstance(string driverInstanceId, string driverConfigJson)
{
ArgumentException.ThrowIfNullOrWhiteSpace(driverInstanceId);
var options = ParseOptions(driverInstanceId, driverConfigJson);
return new AbCipDriver(options, driverInstanceId);
}
/// <summary>
/// Deserialise an AB CIP driver-config JSON document into <see cref="AbCipDriverOptions"/>.
/// Shared by <see cref="CreateInstance"/> (first construction) and
/// <see cref="AbCipDriver.InitializeAsync"/> / <see cref="AbCipDriver.ReinitializeAsync"/>
/// so a reinitialize with a changed config JSON (new device, new tag, changed timeout)
/// actually takes effect rather than being silently discarded.
/// </summary>
internal static AbCipDriverOptions ParseOptions(string driverInstanceId, string driverConfigJson)
{
ArgumentException.ThrowIfNullOrWhiteSpace(driverInstanceId);
ArgumentException.ThrowIfNullOrWhiteSpace(driverConfigJson);
@@ -29,7 +43,7 @@ public static class AbCipDriverFactoryExtensions
?? throw new InvalidOperationException(
$"AB CIP driver config for '{driverInstanceId}' deserialised to null");
var options = new AbCipDriverOptions
return new AbCipDriverOptions
{
Devices = dto.Devices is { Count: > 0 }
? [.. dto.Devices.Select(d => new AbCipDeviceOptions(
@@ -53,9 +67,8 @@ public static class AbCipDriverFactoryExtensions
EnableControllerBrowse = dto.EnableControllerBrowse ?? false,
EnableAlarmProjection = dto.EnableAlarmProjection ?? false,
AlarmPollInterval = TimeSpan.FromMilliseconds(dto.AlarmPollIntervalMs ?? 1_000),
EnableDeclarationOnlyUdtGrouping = dto.EnableDeclarationOnlyUdtGrouping ?? false,
};
return new AbCipDriver(options, driverInstanceId);
}
private static AbCipTagDefinition BuildTag(AbCipTagDto t, string driverInstanceId) =>
@@ -108,6 +121,7 @@ public static class AbCipDriverFactoryExtensions
public int? TimeoutMs { get; init; }
public bool? EnableControllerBrowse { get; init; }
public bool? EnableAlarmProjection { get; init; }
public bool? EnableDeclarationOnlyUdtGrouping { get; init; }
public int? AlarmPollIntervalMs { get; init; }
public List<AbCipDeviceDto>? Devices { get; init; }
public List<AbCipTagDto>? Tags { get; init; }
@@ -56,6 +56,20 @@ public sealed class AbCipDriverOptions
/// 1 second — matches typical SCADA alarm-refresh conventions.
/// </summary>
public TimeSpan AlarmPollInterval { get; init; } = TimeSpan.FromSeconds(1);
/// <summary>
/// Opt-in for the declaration-only whole-UDT read fast path. When <c>false</c> (the
/// default) a batch of UDT members is always read per-member, because the byte offsets
/// computed by <see cref="AbCipUdtMemberLayout"/> assume the controller lays members
/// out in declaration order — and the Studio 5000 compiler does NOT guarantee that
/// (it reorders for largest-first packing, BOOL host bytes, nested-struct padding).
/// Decoding at declaration-order offsets against a reordered controller layout yields
/// silently-plausible wrong numbers. Set <c>true</c> only when the operator has
/// hand-verified that every configured UDT's member declaration order matches the
/// controller's compiled layout; in that case whole-UDT grouping collapses N member
/// reads into one. The richer CIP Template Object path remains the long-term fix.
/// </summary>
public bool EnableDeclarationOnlyUdtGrouping { get; init; }
}
/// <summary>
@@ -1,3 +1,5 @@
using libplctag;
namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip;
/// <summary>
@@ -24,8 +26,10 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip;
/// writes during download / test-mode transitions).</item>
/// <item>0x16 object does not exist — <c>BadNodeIdUnknown</c>.</item>
/// <item>0x1E embedded service error — unwrap to the extended status when possible.</item>
/// <item>any libplctag <c>PLCTAG_STATUS_*</c> below zerowrapped as
/// <c>BadCommunicationError</c> until fine-grained mapping lands (PR 3).</item>
/// <item>libplctag.NET <see cref="Status"/> errorsmapped per-member by
/// <see cref="MapLibplctagStatus(Status)"/>: timeout, not-found, not-allowed, and
/// out-of-bounds get their specific OPC UA codes; the remaining transport errors
/// fold into <c>BadCommunicationError</c>.</item>
/// </list>
/// </remarks>
public static class AbCipStatusMapper
@@ -58,22 +62,34 @@ public static class AbCipStatusMapper
};
/// <summary>
/// Map a libplctag return/status code (<c>PLCTAG_STATUS_*</c>) to an OPC UA StatusCode.
/// libplctag uses <c>0 = PLCTAG_STATUS_OK</c>, positive values for pending, negative
/// values for errors.
/// Map a libplctag return/status code to an OPC UA StatusCode. The integer passed here
/// is <c>(int)Tag.GetStatus()</c> — i.e. the underlying value of the libplctag.NET
/// <see cref="Status"/> enum, NOT a raw native <c>PLCTAG_ERR_*</c> constant. The wrapper
/// renumbers the native codes into a contiguous enum, so this method switches on the
/// <see cref="Status"/> members directly to stay correct if the wrapper renumbers again.
/// <see cref="Status.Ok"/> is success; <see cref="Status.Pending"/> is an in-flight
/// operation; every other (negative) member is an error.
/// </summary>
public static uint MapLibplctagStatus(int status)
public static uint MapLibplctagStatus(int status) => MapLibplctagStatus((Status)status);
/// <summary>
/// Map a libplctag.NET <see cref="Status"/> enum value to an OPC UA StatusCode. This is
/// the strongly-typed core of the mapper; the <c>int</c> overload exists only for the
/// <see cref="IAbCipTagRuntime.GetStatus"/> seam, which returns the boxed-as-int value.
/// </summary>
public static uint MapLibplctagStatus(Status status) => status switch
{
if (status == 0) return Good;
if (status > 0) return GoodMoreData; // PLCTAG_STATUS_PENDING
return status switch
{
-5 => BadTimeout, // PLCTAG_ERR_TIMEOUT
-7 => BadCommunicationError, // PLCTAG_ERR_BAD_CONNECTION
-14 => BadNodeIdUnknown, // PLCTAG_ERR_NOT_FOUND
-16 => BadNotWritable, // PLCTAG_ERR_NOT_ALLOWED / read-only tag
-17 => BadOutOfRange, // PLCTAG_ERR_OUT_OF_BOUNDS
_ => BadCommunicationError,
};
}
Status.Ok => Good,
Status.Pending => GoodMoreData,
Status.ErrorTimeout => BadTimeout,
Status.ErrorNotFound or Status.ErrorNoMatch or Status.ErrorBadDevice => BadNodeIdUnknown,
Status.ErrorNotAllowed => BadNotWritable,
Status.ErrorOutOfBounds or Status.ErrorTooLarge or Status.ErrorTooSmall => BadOutOfRange,
Status.ErrorUnsupported or Status.ErrorNotImplemented => BadNotSupported,
Status.ErrorBadConnection or Status.ErrorBadGateway or Status.ErrorBadReply
or Status.ErrorWinsock or Status.ErrorOpen or Status.ErrorClose
or Status.ErrorRead or Status.ErrorWrite or Status.ErrorRemoteErr
or Status.ErrorPartial or Status.ErrorAbort => BadCommunicationError,
_ => BadCommunicationError,
};
}
@@ -8,17 +8,27 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip;
/// list that <see cref="AbCipDriver.ReadAsync"/> runs through its existing read path.
/// Pure function — the planner never touches the runtime + never reads the PLC.
/// </summary>
/// <remarks>
/// The grouped offsets come from <see cref="AbCipUdtMemberLayout"/>, which assumes the
/// controller lays members out in declaration order. Studio 5000 does not guarantee that,
/// so grouping is gated behind <see cref="AbCipDriverOptions.EnableDeclarationOnlyUdtGrouping"/>:
/// when grouping is disabled every member falls back to its own per-tag read.
/// </remarks>
public static class AbCipUdtReadPlanner
{
/// <summary>
/// Split <paramref name="requests"/> into whole-UDT groups + per-tag leftovers.
/// <paramref name="tagsByName"/> is the driver's <c>_tagsByName</c> map — both parent
/// UDT rows and their fanned-out member rows live there. Lookup is OrdinalIgnoreCase
/// to match the driver's dictionary semantics.
/// to match the driver's dictionary semantics. When
/// <paramref name="enableDeclarationOnlyGrouping"/> is <c>false</c> no groups are
/// formed — every reference goes to the per-tag fallback path so member decoding never
/// relies on declaration-order offsets that may not match the controller layout.
/// </summary>
public static AbCipUdtReadPlan Build(
IReadOnlyList<string> requests,
IReadOnlyDictionary<string, AbCipTagDefinition> tagsByName)
IReadOnlyDictionary<string, AbCipTagDefinition> tagsByName,
bool enableDeclarationOnlyGrouping = false)
{
ArgumentNullException.ThrowIfNull(requests);
ArgumentNullException.ThrowIfNull(tagsByName);
@@ -26,6 +36,13 @@ public static class AbCipUdtReadPlanner
var fallback = new List<AbCipUdtReadFallback>(requests.Count);
var byParent = new Dictionary<string, List<AbCipUdtReadMember>>(StringComparer.OrdinalIgnoreCase);
if (!enableDeclarationOnlyGrouping)
{
for (var i = 0; i < requests.Count; i++)
fallback.Add(new AbCipUdtReadFallback(i, requests[i]));
return new AbCipUdtReadPlan([], fallback);
}
for (var i = 0; i < requests.Count; i++)
{
var name = requests[i];

Some files were not shown because too many files have changed in this diff Show More