Compare commits

...

209 Commits

Author SHA1 Message Date
Joseph Doherty
5506b43ddc Doc refresh (task #204) — operational docs for multi-process multi-driver OtOpcUa
Five operational docs rewritten for v2 (multi-process, multi-driver, Config-DB authoritative):

- docs/Configuration.md — replaced appsettings-only story with the two-layer model.
  appsettings.json is bootstrap only (Node identity, Config DB connection string,
  transport security, LDAP bind, logging). Authoritative config (clusters, namespaces,
  UNS, equipment, tags, driver instances, ACLs, role grants, poll groups) lives in
  the Config DB accessed via OtOpcUaConfigDbContext and edited through the Admin UI
  draft/publish workflow. Added v1-to-v2 migration index so operators can locate where
  each old section moved. Cross-links to docs/v2/config-db-schema.md + docs/v2/admin-ui.md.

- docs/Redundancy.md — Phase 6.3 rewrite. Named every class under
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/: RedundancyCoordinator, RedundancyTopology,
  ApplyLeaseRegistry (publish fencing), PeerReachabilityTracker, RecoveryStateManager,
  ServiceLevelCalculator (pure function), RedundancyStatePublisher. Documented the
  full 11-band ServiceLevel matrix (Maintenance=0 through AuthoritativePrimary=255)
  from ServiceLevelCalculator.cs and the per-ClusterNode fields (RedundancyRole,
  ServiceLevelBase, ApplicationUri). Covered metrics
  (otopcua.redundancy.role_transition counter + primary/secondary/stale_count gauges
  on meter ZB.MOM.WW.OtOpcUa.Redundancy) and SignalR RoleChanged push from
  FleetStatusPoller to RedundancyTab.razor.

- docs/security.md — preserved the transport-security section (still accurate) and
  added Phase 6.2 authorization. Four concerns now documented in one place:
  (1) transport security profiles, (2) OPC UA auth via LdapUserAuthenticator
  (note: task spec called this LdapAuthenticationProvider — actual class name is
  LdapUserAuthenticator in Server/Security/), (3) data-plane authorization via
  NodeAcl + PermissionTrie + AuthorizationGate — additive-only model per decision
  #129, ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag hierarchy,
  NodePermissions bundle, PermissionProbeService in Admin for "probe this permission",
  (4) control-plane authorization via LdapGroupRoleMapping + AdminRole
  (ConfigViewer / ConfigEditor / FleetAdmin, CanEdit / CanPublish policies) —
  deliberately independent of data-plane ACLs per decision #150. Documented the
  OTOPCUA0001 Roslyn analyzer (UnwrappedCapabilityCallAnalyzer) as the compile-time
  guard ensuring every driver-capability async call is wrapped by CapabilityInvoker.

- docs/ServiceHosting.md — three-process rewrite: OtOpcUa Server (net10 x64,
  BackgroundService + AddWindowsService, hosts OPC UA endpoint + all non-Galaxy
  drivers), OtOpcUa Admin (net10 x64, Blazor Server + SignalR + /metrics via
  OpenTelemetry Prometheus exporter), OtOpcUa Galaxy.Host (.NET Framework 4.8 x86,
  NSSM-wrapped, env-variable driven, STA thread + MXAccess COM). Pipe ACL
  denies-Admins detail + non-elevated shell requirement captured from feedback memory.
  Divergence from CLAUDE.md: task spec said "TopShelf is still the service-installer
  wrapper per CLAUDE.md note" but no csproj in the repo references TopShelf — decision
  #30 replaced it with the generic host's AddWindowsService wrapper (per the doc
  comment on OpcUaServerService). Reflected the actual state + flagged this divergence
  here so someone can update CLAUDE.md separately.

- docs/StatusDashboard.md — replaced the full v1 reference (dashboard endpoints,
  health check rules, StatusData DTO, etc.) with a short "superseded by Admin UI"
  pointer that preserves git-blame continuity + avoids broken links from other docs
  that reference it.

Class references verified by reading:
  src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{RedundancyCoordinator, ServiceLevelCalculator,
      ApplyLeaseRegistry, RedundancyStatePublisher}.cs
  src/ZB.MOM.WW.OtOpcUa.Core/Authorization/{PermissionTrie, PermissionTrieBuilder,
      PermissionTrieCache, TriePermissionEvaluator, AuthorizationGate}.cs
  src/ZB.MOM.WW.OtOpcUa.Server/Security/{AuthorizationGate, LdapUserAuthenticator}.cs
  src/ZB.MOM.WW.OtOpcUa.Admin/{Program.cs, Services/AdminRoles.cs,
      Services/RedundancyMetrics.cs, Hubs/FleetStatusPoller.cs}
  src/ZB.MOM.WW.OtOpcUa.Server/Program.cs + appsettings.json
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/{Program.cs, Ipc/PipeServer.cs}
  src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{ClusterNode, NodeAcl,
      LdapGroupRoleMapping}.cs
  src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:34:25 -04:00
Joseph Doherty
71339307fa Doc refresh (task #203) — driver docs split + drivers index + IHistoryProvider-aware HistoricalDataAccess
Restructure the driver-facing docs to match the OtOpcUa v2 multi-driver
reality (Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, OPC UA Client
— 8 drivers total; Galaxy ships as three projects) and the capability-interface
architecture where every driver opts into IDriver + whichever of IReadable /
IWritable / ITagDiscovery / ISubscribable / IHostConnectivityProbe /
IPerCallHostResolver / IAlarmSource / IHistoryProvider / IRediscoverable it
supports. Doc scope follows the code: one-driver-specific docs scoped to that
driver, cross-driver concerns live once at the top level, per-driver specs
cross-link to docs/v2/driver-specs.md rather than duplicate.

What changed per file:

- docs/MxAccessBridge.md -> docs/drivers/Galaxy.md (git mv + rewrite): retitled
  "Galaxy Driver", reframed as one of seven drivers. Added Project Split table
  (Shared .NET Standard 2.0 / Host .NET 4.8 x86 / Proxy .NET 10) and Why
  Out-of-Process section citing both the MXAccess bitness constraint and Tier C
  stability isolation per docs/v2/plan.md section 4. Added IPC Transport
  section covering pipe naming, MessagePack framing, DACL that denies Admins,
  shared-secret handshake, heartbeat, and CallAsync<TReq,TResp> dispatch.
  Moved file paths from src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/* to
  src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/MxAccess/* and added the
  Shared + Proxy key-file tables. Added CapabilityInvoker + OTOPCUA0001
  analyzer callout. Cross-linked to drivers/README.md, Galaxy-Repository.md,
  HistoricalDataAccess.md.

- docs/GalaxyRepository.md -> docs/drivers/Galaxy-Repository.md (git mv +
  rewrite): retitled "Galaxy Repository — Tag Discovery for the Galaxy
  Driver", opened with a comparison table showing how every driver's
  ITagDiscovery source is different (AB CIP @tags walker, TwinCAT
  SymbolLoaderFactory, FOCAS CNC queries, OPC UA Client Session.Browse, etc).
  Repositioned GalaxyRepositoryService as the Galaxy driver's
  ITagDiscovery.DiscoverAsync implementation. Updated paths to
  Driver.Galaxy.Host/Backend/GalaxyRepository/*. Added IRediscoverable section
  covering the on-change-redeploy IPC path.

- docs/drivers/README.md (new): index with ground-truth driver table —
  project path, stability tier, wire library, capability-interface list, and
  one notable quirk per driver. Verified against the driver csproj files and
  class declarations on focas-pr3-remaining-capabilities (the most recent
  branch containing every driver). Galaxy gets its own dedicated docs; the
  other seven drivers cross-link to docs/v2/driver-specs.md. Lists the full
  Core.Abstractions capability surface, DriverTypeRegistry, CapabilityInvoker,
  and OTOPCUA0001 analyzer.

- docs/HistoricalDataAccess.md (rewrite): reframed around IHistoryProvider as
  a per-driver optional capability interface. Replaced v1 HistorianPluginLoader
  / AvevaHistorianPluginEntry plugin architecture with the v2 story —
  Historian.Aveva was merged into Driver.Galaxy.Host/Backend/Historian/ and
  IPC-forwarded through GalaxyProxyDriver. Documented all four IHistoryProvider
  methods (ReadRawAsync / ReadProcessedAsync / ReadAtTimeAsync /
  ReadEventsAsync), CapabilityInvoker wrapping with DriverCapability.HistoryRead,
  and the per-driver coverage matrix (Galaxy + OPC UA Client implement; the
  six protocol drivers don't and return BadHistoryOperationUnsupported). Kept
  the cluster-failover + health-counter + quality-mapping detail for the
  Galaxy Historian implementation. Flagged one gap: Proxy forwards all four
  history message kinds but the Host-side HistoryAggregateType -> AnalogSummary
  column mapping may surface GalaxyIpcException{Code="not-implemented"} on a
  given branch until the Phase 2 Galaxy out-of-process gate lands.

Driver list built against ground truth (src on focas-pr3-remaining-capabilities):
  Driver.Galaxy.{Shared,Host,Proxy}, Driver.Modbus, Driver.S7, Driver.AbCip,
  Driver.AbLegacy, Driver.TwinCAT, Driver.FOCAS, Driver.OpcUaClient.
Capability interface lists verified against each *Driver.cs class declaration.
Aveva Historian ported to Driver.Galaxy.Host/Backend/Historian/; no separate
Historian.Aveva assembly on v2 branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:33:53 -04:00
Joseph Doherty
985b7aba26 Doc refresh (task #202) — core architecture docs for multi-driver OtOpcUa
Rewrite seven core-architecture docs to match the shipped multi-driver platform.
The v1 single-driver LmxNodeManager framing is replaced with the Core +
capability-interface model — Galaxy is now one driver of seven, and each doc
points at the current class names + source paths.

What changed per file:
- OpcUaServer.md — OtOpcUaServer as StandardServer host; per-driver
  DriverNodeManager + CapabilityInvoker wiring; Config-DB-driven configuration
  (sp_PublishGeneration, DraftRevisionToken, Admin UI); Phase 6.2
  AuthorizationGate integration.
- AddressSpace.md — GenericDriverNodeManager.BuildAddressSpaceAsync walks
  ITagDiscovery.DiscoverAsync and streams DriverAttributeInfo through
  IAddressSpaceBuilder; CapturingBuilder registers alarm-condition sinks;
  per-driver NodeId schemes replace the fixed ns=1;s=ZB root.
- ReadWriteOperations.md — OnReadValue / OnWriteValue dispatch to
  IReadable.ReadAsync / IWritable.WriteAsync through CapabilityInvoker,
  honoring WriteIdempotentAttribute (#143); two-layer authorization
  (WriteAuthzPolicy + Phase 6.2 AuthorizationGate).
- Subscriptions.md — ISubscribable.SubscribeAsync/UnsubscribeAsync is the
  capability surface; STA-thread story is now Galaxy-specific (StaPump inside
  Driver.Galaxy.Host), other drivers are free-threaded.
- AlarmTracking.md — IAlarmSource is optional; AlarmSurfaceInvoker wraps
  Subscribe/Ack/Unsubscribe with fan-out by IPerCallHostResolver and the
  no-retry AlarmAcknowledge pipeline (#143); CapturingBuilder registers sinks
  at build time.
- DataTypeMapping.md — DriverDataType + SecurityClassification are the
  driver-agnostic enums; per-driver mappers (GalaxyProxyDriver inline,
  AbCipDataType, ModbusDriver, etc.); SecurityClassification is metadata only,
  ACL enforcement is at the server layer.
- IncrementalSync.md — IRediscoverable covers backend-change signals;
  sp_ComputeGenerationDiff + DiffViewer drive generation-level change
  detection; IDriver.ReinitializeAsync is the in-process recovery path.
2026-04-20 01:33:28 -04:00
Joseph Doherty
48970af416 Doc refresh (task #205) — requirements updated for multi-driver OtOpcUa three-process deploy
Per-file summary:

- docs/reqs/OpcUaServerReqs.md — rewritten driver-agnostic. OPC-001..OPC-013 re-scoped to multi-driver address-space composition + capability dispatch; OPC-014 AuthorizationGate + permission trie; OPC-015 dynamic ServiceLevel via RedundancyCoordinator; OPC-017 surgical generation-apply rebuild; OPC-012 capability dispatch via CapabilityInvoker (decision #143 idempotence-aware retry); OPC-013 per-host Polly isolation (decision #144); OPC-019 OpenTelemetry metrics. Transport-security profile matrix (OPC-010) + UserName/LDAP (OPC-011) preserved.

- docs/reqs/GalaxyRepositoryReqs.md — scope clarified as Galaxy-driver-only (not platform). GR-001..GR-004 tied to ITagDiscovery.DiscoverAsync + IRediscoverable; all SQL runs inside OtOpcUa.Galaxy.Host and streams to Proxy via named pipe. GR-008 capability wrapping via CapabilityInvoker added. Cross-links to docs/v2/driver-specs.md + docs/GalaxyRepository.md.

- docs/reqs/MxAccessClientReqs.md — scope clarified as Galaxy-Host-only. MXA-001..MXA-009 preserved (STA pump, register/unregister, subscription refcount, auto-reconnect, probe, COM cleanup, operation metrics, error translation). MXA-010 Proxy-side capability wrapping + MXA-011 pipe ACL + per-process shared secret (OTOPCUA_ALLOWED_SID / OTOPCUA_GALAXY_SECRET) added.

- docs/reqs/ServiceHostReqs.md — rewritten for three-process deployment. Shared section (SVC-SHARED-001/002) for Serilog + bootstrap-only appsettings. SRV-* for OtOpcUa.Server (net10 x64, Microsoft.Extensions.Hosting + AddWindowsService, in-process driver hosting, redundancy-node bootstrap). ADM-* for OtOpcUa.Admin (Blazor Server, cookie+LDAP auth, CanEdit/CanPublish policies, sole DB writer, Prometheus /metrics, audit logging). GHX-* for OtOpcUa.Galaxy.Host (TopShelf, net48 x86, named-pipe IPC bootstrap, STA backend lifecycle, crash handling tied to supervisor).

- docs/reqs/ClientRequirements.md — restructured as numbered, verifiable requirements. SHR-* for Client.Shared (single IOpcUaClientService, ConnectionSettings, failover, cross-platform certs, type-coercing write, UI-thread neutrality). CLI-001..CLI-011 cover connect/read/write/browse/subscribe/historyread/alarms/redundancy. UI-001..UI-008 cover connection panel, tree browser, each tab, connection-state reflection, cross-platform build. Reference design content (IOpcUaClientService shape, models, view-model map, mock layout) preserved.

- docs/reqs/StatusDashboardReqs.md — retired cleanly. Replaced with a pointer to docs/v2/admin-ui.md + HLR-015 / HLR-016 / HLR-017 / ADM-*. Mapping table shows each retired DASH-001..DASH-009 requirement's replacement (live cluster-node view via SignalR, Prometheus metrics, driver-instance detail views, etc.). Note that a formal AdminUiReqs.md can be written later if needed for cert compliance.

HighLevelReqs.md was already at the target shape (HLR-001..HLR-018 with Revision header noting retired HLR-009) as of commit f217636; verified identical and no additional edit required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:31:58 -04:00
Joseph Doherty
f217636467 Doc refresh (task #206) — Client.CLI + Client.UI brand flip + new top-level docs/README.md index. Client.CLI.md: replaced stale LmxOpcUa-OPC UA-server references with OtOpcUa throughout the overview + sample output + applicationUri examples (opc.tcp://localhost:4840/OtOpcUa, urn:localhost:OtOpcUa:instanceN); confirmed against src/ZB.MOM.WW.OtOpcUa.Server/Program.cs:69-71 which sets the live endpoint url + application uri to those exact values. Added a driver-agnostic note in the overview — the CLI is reachable against every shipped driver surface because the OPC UA endpoint abstracts them all. Kept the lmxopcua-cli executable name + the {LocalAppData}/LmxOpcUaClient/pki/ PKI folder name AS-IS because those are real filesystem-level residuals the code still uses (Program.cs SetExecutableName + OpcUaClientService.cs:428) — flipping them requires migration shims so existing dev boxes don't lose their trusted-cert store; added explicit doc text explaining the residual + why it persists so future readers aren't confused. Fixed the sample connect-output "Server: LmxOpcUa" to "Server: OtOpcUa Server" matching the live ApplicationName in OpcUaServerOptions.cs:39. Client.UI.md: replaced the 4 LmxOpcUa references — overview one-liner, status-bar mock (now reads "OtOpcUa Server" matching the server's reported ApplicationName), endpoint-url example, settings persistence path. Same residual-explanation note added under the LmxOpcUaClient/settings.json path pointing at the Client.Shared session-factory literal at OpcUaClientService.cs:428. docs/README.md is new — a top-level index distinguishing the two documentation tiers (current reference at docs/*.md vs implementation history + design notes at docs/v2/*.md). Every current-reference doc gets a one-line role description in a section table (Architecture + data-path / Drivers / Operational / Client tooling / Requirements) so a new reader picking up the repo finds their way in without having to grep file names. Cross-link calls out that load-bearing references from top-level docs (plan.md decisions, admin-ui.md, acl-design.md, config-db-schema.md, driver-specs.md, dev-environment.md, test-data-sources.md) live under v2/. Notes up front that the project was renamed LmxOpcUa → OtOpcUa and that any remaining LmxOpcUa-string in paths is a deliberate residual with a migration follow-up, so readers don't chase phantom bugs. Four parallel doc-refresh agents currently working on the rest of docs/*.md (task #202 core architecture, #203 driver docs split, #204 operational, #205 requirements) — those commits will land on separate worktree branches + get folded in together once complete; this index already lists the docs they'll produce (drivers/README.md, drivers/Galaxy.md, drivers/Galaxy-Repository.md) so the final merge just has the content showing up where the index already points.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:25:18 -04:00
5f26fff4f1 Merge pull request (#148) - Roslyn analyzer OTOPCUA0001 2026-04-20 00:54:39 -04:00
Joseph Doherty
5c0d3154c1 Roslyn analyzer — detect unwrapped driver-capability calls (OTOPCUA0001). Closes task #200. New netstandard2.0 analyzer project src/ZB.MOM.WW.OtOpcUa.Analyzers registered as an <Analyzer>-item ProjectReference from the Server csproj so the warning fires at every Server compile. First (and only so far) rule OTOPCUA0001 — "Driver capability call must be wrapped in CapabilityInvoker" — walks every InvocationOperation in the AST + trips when (a) the target method implements one of the seven guarded capability interfaces (IReadable / IWritable / ITagDiscovery / ISubscribable / IHostConnectivityProbe / IAlarmSource / IHistoryProvider) AND (b) the method's return type is Task, Task<T>, ValueTask, or ValueTask<T> — the async-wire-call constraint narrows the rule to the surfaces the Phase 6.1 pipeline actually wraps + sidesteps pure in-memory accessors like IHostConnectivityProbe.GetHostStatuses() which would trigger false positives AND (c) the call does NOT sit inside a lambda argument passed to CapabilityInvoker.ExecuteAsync / ExecuteWriteAsync / AlarmSurfaceInvoker.*. The wrapper detection walks up the syntax tree from the call site, finds any enclosing InvocationExpressionSyntax whose method's containing type is one of the wrapper classes, + verifies the call lives transitively inside that invocation's AnonymousFunctionExpressionSyntax argument — a sibling "result = await driver.ReadAsync(...)" followed by a separate invoker.ExecuteAsync(...) call does NOT satisfy the wrapping rule + the analyzer flags it (regression guard in the 5th test). Five xunit-v3 + Shouldly tests at tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests: direct ReadAsync in server namespace trips; wrapped ReadAsync inside CapabilityInvoker.ExecuteAsync lambda passes; direct WriteAsync trips; direct DiscoverAsync trips; sneaky pattern — read outside the lambda + ExecuteAsync with unrelated lambda nearby — still trips. Hand-rolled test harness compiles a stub-plus-user snippet via CSharpCompilation.WithAnalyzers + runs GetAnalyzerDiagnosticsAsync directly, deliberately avoiding Microsoft.CodeAnalysis.CSharp.Analyzer.Testing.XUnit because that package pins to xunit v2 + this repo is on xunit.v3 everywhere else. RS2008 release-tracking noise suppressed by adding AnalyzerReleases.Shipped.md + AnalyzerReleases.Unshipped.md as AdditionalFiles, which is the canonical Roslyn-analyzer hygiene path. Analyzer DLL referenced from Server.csproj via ProjectReference with OutputItemType=Analyzer + ReferenceOutputAssembly=false — the DLL ships as a compiler plugin, not a runtime dependency. Server build validates clean: the analyzer activates on every Server file but finds zero violations, which confirms the Phase 6.1 wrapping work done in prior PRs is complete + the analyzer is now the regression guard preventing the next new capability surface from being added raw. slnx updated with both the src + tests project entries. Full solution build clean, analyzer suite 5/5 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:52:40 -04:00
74067e7d7e Merge pull request (#147) - OTel Prometheus exporter 2026-04-20 00:43:15 -04:00
Joseph Doherty
ef53553e9d OTel Prometheus exporter wiring — RedundancyMetrics meter now scraped at /metrics. Closes task #201. Picked Prometheus over OTLP per the earlier recommendation (pull-based means no OTel Collector deployment required for the common K8s/containers case; the endpoint ASP.NET-hosts inside the Admin app already, so one less moving part). Adds two NuGet refs to the Admin csproj: OpenTelemetry.Extensions.Hosting 1.15.2 (stable) + OpenTelemetry.Exporter.Prometheus.AspNetCore 1.15.2-beta.1 (the exporter has historically been beta-only; rest of the OTel ecosystem treats it as production-acceptable + it's what the upstream OTel docs themselves recommend for AspNetCore hosts). Program.cs gains a Metrics:Prometheus:Enabled toggle (defaults true; setting to false disables both the MeterProvider registration + the scrape endpoint entirely for locked-down deployments). When enabled, AddOpenTelemetry().WithMetrics() registers a MeterProvider that subscribes to the "ZB.MOM.WW.OtOpcUa.Redundancy" meter (the exact MeterName constant on RedundancyMetrics) + wires AddPrometheusExporter. MapPrometheusScrapingEndpoint() appends a /metrics handler producing the Prometheus text-format output; deliberately NOT authenticated because scrape jobs typically run on a trusted network + operators who need auth wrap the endpoint behind a reverse-proxy basic-auth gate per fleet-ops convention. appsettings.json declares the toggle with Enabled: true so the default deploy gets metrics automatically — turning off is the explicit action. Future meters (resilience tracker + host status + auth probe) just AddMeter("Name") alongside the existing call to start flowing through the same endpoint without more infrastructure. Admin project builds 0 errors; Admin.Tests 92/92 passing (unchanged — the OTel pipeline runs at request time, not test time). Still-pending work that was NOT part of #201's scope: an equivalent setup for the Server project (different MeterNames — the Polly pipeline builder's tracker + host-status publisher) + a metrics cheat-sheet in docs/observability.md documenting each meter's tag set + expected alerting thresholds. Those are natural follow-ups when fleet-ops starts building dashboards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:41:16 -04:00
d1e50db304 Merge pull request (#146) - DiffViewer ACL section 2026-04-20 00:39:11 -04:00
Joseph Doherty
df0d7c2d84 DiffViewer ACL section — extend sp_ComputeGenerationDiff with NodeAcl rows. Closes the final slice of task #196 (draft-diff ACL section). The DiffViewer already rendered a placeholder "NodeAcl" card from the task #156 refactor; it stayed empty because the stored proc didn't emit NodeAcl rows. This PR lights the card up by adding a fifth UNION to the proc. Logical id for NodeAcl is the composite LdapGroup + ScopeKind + ScopeId triple — format "cn=group|Cluster|scope-id" or "cn=group|Cluster|(cluster)" when ScopeId is null (Cluster-wide rows). That shape means a permission-only change (same group + same scope, PermissionFlags shifted) appears as a single Modified row with the full triple as its identifier, whereas a scope move (same group, new ScopeId) correctly surfaces as Added + Removed of two different logical ids. CHECKSUM signature covers ClusterId + PermissionFlags + Notes so both operator-visible changes (permission bitmask) and audit-tier changes (notes) round-trip through the diff. New migration 20260420000001_ExtendComputeGenerationDiffWithNodeAcl.cs ships both Up (install V2 proc) + Down (restore the exact V1 proc text shipped in 20260417215224_StoredProcedures so the migration is reversible). Row-id column widens from nvarchar(64) to nvarchar(128) in V2 since the composite key (group DN + scope + scope-id) exceeds 64 chars comfortably — narrow column would silently truncate in prod. Designer .cs cloned from the prior migration since the EF model is unchanged; DiffViewer.razor section description updated to drop the "(proc-extension pending)" note it carried since task #156 — the card will now populate live. Admin + Core full-solution build clean. No unit-test changes needed — the existing StoredProceduresTests cover the proc-exec path + would immediately catch any SQL syntax regression on next SQL Server integration run. Task #196 fully closed now — Probe-this-permission (slice 1, PR 144), SignalR invalidation (slice 2, PR 145), draft-diff ACL section (this PR).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:37:05 -04:00
16f4b4acad Merge pull request (#145) - ACL + role-grant SignalR invalidation 2026-04-20 00:34:24 -04:00
Joseph Doherty
ac63c2cfb2 ACL + role-grant SignalR invalidation — #196 slice 2. Adds the live-push layer so an operator editing permissions in one Admin session sees the change in peer sessions without a manual reload. Covers both axes of task #196's invalidation requirement: cluster-scoped NodeAcl mutations push NodeAclChanged to that cluster's subscribers; fleet-wide LdapGroupRoleMapping CRUD pushes RoleGrantsChanged to every Admin session on the fleet group. New AclChangeNotifier service wraps IHubContext<FleetStatusHub> with two methods: NotifyNodeAclChangedAsync(clusterId, generationId) + NotifyRoleGrantsChangedAsync(). Both are fire-and-forget — a failed hub send logs a warning + returns; the authoritative DB write already committed, so worst-case peers see stale data until their next poll (AclsTab has no polling today; on-parameter-set reload + this signal covers the practical refresh cases). Catching OperationCanceledException separately so request-teardown doesn't log a false-positive hub-failure. NodeAclService constructor gains an optional AclChangeNotifier param (defaults to null so the existing unit tests that pass only a DbContext keep compiling). GrantAsync + RevokeAsync both emit NodeAclChanged after the SaveChanges completes — the Revoke path uses the loaded row's ClusterId + GenerationId for accurate routing since the caller passes only the surrogate rowId. RoleGrants.razor consumes the notifier after every Create + Delete + opens a fleet-scoped HubConnection on first render that reloads the grant list on RoleGrantsChanged. AclsTab.razor opens a cluster-scoped connection on first render and reloads only when the incoming NodeAclChanged message matches both the current ClusterId + GenerationId (so a peer editing a different draft doesn't trigger spurious reloads). Both pages IAsyncDisposable the connection on navigation away. AclChangeNotifier is DI-registered alongside PermissionProbeService. Two new message records in AclChangeNotifier.cs: NodeAclChangedMessage(ClusterId, GenerationId, ObservedAtUtc) + RoleGrantsChangedMessage(ObservedAtUtc). Admin.Tests 92/92 passing (unchanged — the notifier is fire-and-forget + tested at hub level in existing FleetStatusPoller suite). Admin builds 0 errors. One slice of #196 remains: the draft-diff ACL section (extend sp_ComputeGenerationDiff to emit NodeAcl rows + wire the DiffViewer NodeAcl card from the empty placeholder it currently shows). Next PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:32:28 -04:00
d93dc73978 Merge pull request (#144) - AclsTab Probe-this-permission 2026-04-20 00:30:15 -04:00
Joseph Doherty
ecc2389ca8 AclsTab Probe-this-permission — first of three #196 slices. New /clusters/{ClusterId}/draft/{GenerationId} ACLs-tab gains a probe card above the grant table so operators can ask the trie "if cn=X asks for permission Y on node Z, would it be granted, and which rows contributed?" without shell-ing into the DB. Service thinly wraps the same PermissionTrieBuilder + PermissionTrie.CollectMatches call path the Server's dispatch layer uses at request time, so a probe answer is by construction identical to what the live server would decide. New PermissionProbeService.ProbeAsync(generationId, ldapGroup, NodeScope, requiredFlags) — loads the target generation's NodeAcl rows filtered to the cluster (critical: without the cluster filter, cross-cluster grants leak into the probe which tested false-positive in the unit suite), builds a trie, CollectMatches against the supplied scope + [ldapGroup], ORs the matched-grant flags into Effective, compares to Required. Returns PermissionProbeResult(Granted, Required, Effective, Matches) — Matches carries LdapGroup + Scope + PermissionFlags per matched row so the UI can render the contribution chain. Zero side effects + no audit rows — a failing probe is a question, not a denial. AclsTab.razor gains the probe card at the top (before the New-grant form + grant table): six inputs for ldap group + every NodeScope level (NamespaceId → UnsAreaId → UnsLineId → EquipmentId → TagId — blank fields become null so the trie walks only as deep as the operator specified), a NodePermissions dropdown filtered to skip None, Probe button, green Granted / red Denied badge + Required/Effective bitmask display, and (when matches exist) a small table showing which LdapGroup matched at which level with which flags. Admin csproj adds ProjectReference to Core — the trie + NodeScope live there + were previously Server-only. Five new PermissionProbeServiceTests covering: cluster-level row grants a namespace-level read; no-group-match denies with empty Effective; matching group but insufficient flags (Browse+Read vs WriteOperate required) denies with correct Effective bitmask; cross-cluster grants stay isolated (c2's WriteOperate does NOT leak into c1's probe); generation isolation (gen1's Read-only does NOT let gen2's WriteOperate-requiring probe pass). Admin.Tests 92/92 passing (was 87, +5). Admin builds 0 errors. Remaining #196 slices — SignalR invalidation + draft-diff ACL section — ship in follow-up PRs so the review surface per PR stays tight.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:28:17 -04:00
852c710013 Merge pull request (#143) - Pin ab_server to libplctag v2.6.16 2026-04-20 00:06:29 -04:00
Joseph Doherty
8ce5791f49 Pin libplctag ab_server to v2.6.16 — real release tag + SHA256 hashes for all three Windows arches. Closes the "pick a current version + pin" deferral left by the #180 PR docs stub. Verified the release lands ab_server.exe inside libplctag_2.6.16_windows_<arch>_tools.zip alongside plctag.dll + list_tags_* helpers by downloading each tools zip + unzip -l'ing to confirm ab_server.exe is present at 331264 bytes. New ci/ab-server.lock.json is the single source of truth — one file the CI YAML reads via ConvertFrom-Json instead of duplicating the hash across the workflow + the docs. Structure: repo (libplctag/libplctag) + tag (v2.6.16) + published date (2026-03-29) + assets keyed by platform (windows-x64 / windows-x86 / windows-arm64) each carrying filename + sha256. docs/v2/test-data-sources.md §2.CI updated — replaces the prior placeholder (ver = '<pinned libplctag release tag>', expected = '<pinned sha256>') with the real v2.6.16 + 9b78a3de... hashes pinned table, and replaces the hardcoded URL with a lockfile-driven pwsh step that picks windows-x64 by default but swaps to x86/arm64 by changing one line for non-x64 CI runners. Hash-mismatch path throws with both the expected + actual values so on the first drift the CI log tells the maintainer exactly what to update in the lockfile. Two verification notes from the release fetch: (1) libplctag v2.6.16 tools zips ship ab_server.exe + plctag.dll together — tests don't need a separate libplctag NuGet download for the integration path, the extracted tools dir covers both the simulator + the driver's native dependency; (2) the three Windows arches all carry ab_server.exe, so ARM64 Windows GitHub runners (when they arrive) can run the integration suite without changes beyond swapping the asset key. No code changes in this PR — purely docs + the new lockfile. Admin tests + Core tests unchanged + passing per the prior commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:04:35 -04:00
05ddea307b Merge pull request (#142) - ab_server per-family profiles 2026-04-19 23:59:20 -04:00
Joseph Doherty
32dff7f1d6 ab_server integration fixture — per-family profiles + documented CI-fetch contract. Closes task #180 (AB CIP follow-up — ab_server CI fixture). Replaces the prior hardcoded single-family fixture with a parametric AbServerProfile abstraction covering ControlLogix / CompactLogix / Micro800 / GuardLogix. Prebuilt-Windows-binary fetch is documented as a CI YAML step rather than fabricated C#-side, because SHA-pinned binary distribution is a CI workflow concern (libplctag owns releases, we pin a version + verify hash) not a test-framework concern. New AbServerProfile record + KnownProfiles static class at tests/.../AbServerProfile.cs. Four profiles: ControlLogix (widest coverage — DINT/REAL/BOOL/SINT/STRING atomic + DINT[16] array so the driver's @tags Symbol-Object decoder + array-bound path both get end-to-end coverage), CompactLogix (atomic subset — driver-side ConnectionSize quirk from PR 10 still applies since ab_server doesn't enforce the narrower limit), Micro800 (ab_server has no dedicated --plc micro800 mode — falls back to controllogix while driver-side path enforces empty routing + unconnected-only per PR 11; real Micro800 coverage requires a 2080 lab rig), GuardLogix (ab_server has no safety subsystem — profile emulates the _S-suffixed naming contract the driver's safety-ViewOnly classification reads in PR 12; real safety-lock behavior requires a 1756-L8xS physical rig). Each profile composes --plc + --tag args via BuildCliArgs(port) — pure string formatter so the composition logic is unit-testable without launching the simulator. AbServerFixture gains a ctor overload taking AbServerProfile + port (defaults back to ControlLogix on parameterless ctor so existing test suites keep compiling). Fixture's InitializeAsync hands the profile's CLI args to ProcessStartInfo.Arguments. New AbServerTheoryAttribute mirrors AbServerFactAttribute but extends TheoryAttribute so a single test can MemberData over KnownProfiles.All + cover all four families. AbCipReadSmokeTests converted from single-fact to theory parametrized over KnownProfiles.All — one row per family reads TestDINT + asserts Good status + Healthy driver state. Fixture lifecycle is explicit try/finally rather than await using because IAsyncLifetime.DisposeAsync returns ValueTask + xUnit's concrete IAsyncDisposable shim depends on xunit version; explicit beats implicit here. Eight new unit tests in AbServerProfileTests.cs (runs without the simulator so CI green even when the binary is absent): BuildCliArgs composes port + plc + tag flags in the documented order; empty seed-tag list still emits port + plc; SeedTag.ToCliSpec handles both 2-segment scalar + 3-segment array; KnownProfiles.ForFamily returns expected --plc arg for every family (verifies Micro800 + GuardLogix both fall back to controllogix); KnownProfiles.All covers every AbCipPlcFamily enum value (regression guard — adding a new family without a profile fails this test); ControlLogix seeds every atomic type the driver supports; GuardLogix seeds at least one _S-suffixed safety tag. Integration tests still skip cleanly when ab_server isn't on PATH. 11/11 unit tests passing in this project (8 new + 3 prior). Full Admin solution builds 0 errors. docs/v2/test-data-sources.md gets a new "CI fixture" subsection under §2.Gotchas with the exact GitHub Actions YAML step — fetch the pinned libplctag release, SHA256-verify against a pinned hash recorded in the repo's CI lockfile (drift = fail closed), extract, append to PATH. The C# harness stays PATH-driven so dev-box installs (cmake + make from source) work identically to CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:57:24 -04:00
42649ca7b0 Merge pull request (#141) - Redundancy OTel + SignalR 2026-04-19 23:18:04 -04:00
Joseph Doherty
1f3343e61f OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:16:09 -04:00
251f567b98 Merge pull request (#140) - AlarmSurfaceInvoker 2026-04-19 23:09:35 -04:00
Joseph Doherty
404bfbe7e4 AlarmSurfaceInvoker — wraps IAlarmSource.Subscribe/Unsubscribe/Acknowledge through CapabilityInvoker with multi-host fan-out. Closes alarm-surface slice of task #161 (Phase 6.1 Stream A); the Roslyn invoker-coverage analyzer is split into new task #200 because a DiagnosticAnalyzer project is genuinely its own scaffolding PR (Microsoft.CodeAnalysis.CSharp.Workspaces dep, netstandard2.0 target, Microsoft.CodeAnalysis.Testing harness, ProjectReference OutputItemType=Analyzer wiring, and four corner-case rules I want tests for before shipping). Ship this PR as the runtime guardrail + callable API; the analyzer lands next as the compile-time guardrail. New AlarmSurfaceInvoker class in Core.Resilience. Three methods mirror IAlarmSource's three mutating surfaces: SubscribeAsync (fan-out: group sourceNodeIds by IPerCallHostResolver.ResolveHost, one CapabilityInvoker.ExecuteAsync per host with DriverCapability.AlarmSubscribe so AlarmSubscribe's retry policy kicks in + returns one IAlarmSubscriptionHandle per host); UnsubscribeAsync (single-host, defaultHost); AcknowledgeAsync (fan-out: group AlarmAcknowledgeRequests by resolver-mapped host, run each host's batch through DriverCapability.AlarmAcknowledge which does NOT retry per decision #143 — alarm-ack is a write-shaped op that's not idempotent at the plant-floor level). Drivers without IPerCallHostResolver (Galaxy single MXAccess endpoint, OpcUaClient against one remote, etc.) fall back to defaultHost = DriverInstanceId so breaker + bulkhead keying still happens; drivers with it get one-dead-PLC-doesn't-poison-siblings isolation per decision #144. Single-host single-subscribe returns [handle] with length 1; empty sourceNodeIds fast-paths to [] without a driver call. Five new AlarmSurfaceInvokerTests covering: (a) empty list short-circuits — driver method never called; (b) single-host sub routes via default host — one driver call with full id list; (c) multi-host sub fans out to 2 distinct hosts for 3 src ids mapping to 2 plcs — one driver call per host; (d) Acknowledge does not retry on failure — call count stays at 1 even with exception; (e) Subscribe retries transient failures — call count reaches 3 with a 2-failures-then-success fake. Core.Tests resilience-builder suite 19/19 passing (was 14, +5); Core.Tests whole suite still green. Core project builds 0 errors. Task #200 captures the compile-time guardrail: Roslyn DiagnosticAnalyzer at src/ZB.MOM.WW.OtOpcUa.Analyzers that flags direct invocations of the eleven capability-interface methods inside the Server namespace when the call is NOT inside a CapabilityInvoker.ExecuteAsync/ExecuteWriteAsync/AlarmSurfaceInvoker.*Async lambda. That analyzer is the reason we keep paying the wrapping-class overhead for every new capability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:07:37 -04:00
006af636a0 Merge pull request (#139) - ExternalIdReservation merge in FinaliseBatch 2026-04-19 23:04:25 -04:00
Joseph Doherty
c0751fdda5 ExternalIdReservation merge inside FinaliseBatchAsync. Closes task #197. The FinaliseBatch docstring called this out as a narrower follow-up pending a concurrent-insert test matrix, and the CSV import UI PR (#163) noted that operators would see raw DbUpdate UNIQUE-constraint messages on ZTag/SAPID collision until this landed. Now every finalised-batch row reserves ZTag + SAPID in the same EF transaction as the Equipment inserts, so either both commit atomically or neither does. New MergeReservation helper handles the four outcomes per (Kind, Value) pair: (1) value empty/whitespace → skip the reservation entirely (operator left the optional identifier blank); (2) active reservation exists for same EquipmentUuid → bump LastPublishedAt + reuse (re-finalising a batch against the same equipment must be idempotent, e.g. a retry after a transient DB blip); (3) active reservation exists for a DIFFERENT EquipmentUuid → throw ExternalIdReservationConflictException with the conflicting UUID + originating cluster + first-published timestamp so operator sees exactly who owns the value + where to resolve it (release via sp_ReleaseExternalIdReservation or pick a new ZTag); (4) no active reservation → create a fresh row with FirstPublishedBy = batch.CreatedBy + FirstPublishedAt = transaction time. Pre-commit overlap scan uses one round-trip (WHERE Kind+Value IN the batch's distinct sets, filtered to ReleasedAt IS NULL so explicitly-released values can be re-issued per decision #124) + caches the results in a Dictionary keyed on (Kind, value.ToLowerInvariant()) for O(1) lookup during the row loop. Race-safety catch: if another finalise commits between our cache-load + our SaveChanges, SQL Server surfaces a 2601/2627 unique-index violation against UX_ExternalIdReservation_KindValue_Active — IsReservationUniquenessViolation walks the inner-exception chain for that specific signature + rethrows as ExternalIdReservationConflictException so the UI shows a clean message instead of a raw DbUpdateException. The index-name match means unrelated filtered-unique violations (future indices) don't get mis-classified. Test-fixture Row() helper updated to generate unique SAPID per row (sap-{ZTag}) — the prior shared SAPID="sap" worked only because reservations didn't exist; two rows sharing a SAPID under different EquipmentUuids now collide as intended by decision #124's fleet-wide uniqueness rule. Four new tests: (a) finalise creates both ZTag + SAPID reservations with expected Kind + Value; (b) re-finalising same EquipmentUuid's ZTag from a different batch does not create a duplicate (LastPublishedAt refresh only); (c) different EquipmentUuid claiming the same ZTag throws ExternalIdReservationConflictException with the ZTag value in the message + Equipment row for the second batch is NOT inserted (transaction rolled back cleanly); (d) row with empty ZTag + empty SAPID skips reservation entirely. Full Admin.Tests suite 85/85 passing (was 81 before this PR, +4). Admin project builds 0 errors. Note: the InMemory EF provider doesn't enforce filtered-unique indices, so the IsReservationUniquenessViolation catch is exercised only in the SQL Server integration path — the in-memory tests cover the cache-level conflict detection in MergeReservation instead, which is the first line of defence + catches the same-batch + published-vs-staged cases. The DbUpdate catch protects only the last-second race where two concurrent transactions both passed the cache check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 23:02:31 -04:00
80e080ecec Merge pull request (#138) - UnsTab drag/drop + 409 conflict modal 2026-04-19 22:32:45 -04:00
Joseph Doherty
5ee510dc1a UnsTab native HTML5 drag/drop + 409 concurrent-edit modal + optimistic-concurrency commit path. Closes UI slice of task #153 (Phase 6.4 Stream A UI follow-up). Playwright E2E smoke is split into new task #199 — Playwright install + WebApplicationFactory + seeded-DB harness is genuinely its own infra-setup PR. Native HTML5 attributes (draggable, @ondragstart, @ondragover, @ondragleave, @ondrop) deliberately over MudBlazor per the task title — no MudBlazor ever joins this project. Two new service methods on UnsService land the data layer the existing UnsImpactAnalyzer assumed but which didn't actually exist: (1) LoadSnapshotAsync(generationId) — walks UnsAreas + UnsLines + per-line equipment counts + builds a UnsTreeSnapshot including a 16-char SHA-256 revision token computed deterministically over the sorted (kind, id, parent, name, notes) tuple-set so it's stable across processes + changes whenever any row is added / modified / deleted; (2) MoveLineAsync(generationId, expectedToken, lineId, targetAreaId) — re-parents one line inside the same draft under an EF transaction, recomputes the current revision token from freshly-loaded rows, and throws DraftRevisionConflictException when the caller-supplied token no longer matches. Token mismatch means another operator mutated the draft between preview + commit + the move rolls back rather than clobbering their work. No-op same-area drop is a silent return. Cross-generation move is prevented by the generationId filter on the transaction reads. UnsTab.razor gains draggable="true" on every line row with @ondragstart capturing the LineId into _dragLineId, and every area row is a drop target (@ondragover with :preventDefault so the browser accepts drops, @ondrop kicking off OnLineDroppedAsync). Drop path loads a fresh snapshot, builds a UnsMoveOperation(Kind=LineMove, source/target cluster matching because cross-cluster is decision-#82 rejected), runs UnsImpactAnalyzer.Analyze + shows a Bootstrap modal rendered inline in the component — modal shows HumanReadableSummary + equipment/tag counts + any CascadeWarnings list. Confirm button calls MoveLineAsync with the snapshot's RevisionToken; DraftRevisionConflictException surfaces a separate red-header "Draft changed — refresh required" modal with a Reload button that re-fetches areas + lines from the DB. New DraftRevisionConflictException in UnsService.cs, co-located with the service that throws it. Five new UnsServiceMoveTests covering LoadSnapshotAsync (areas + lines + equipment counts), RevisionToken stability between two reads, RevisionToken changes on AddLineAsync, MoveLineAsync happy path reparents the line in the DB, MoveLineAsync with stale token throws DraftRevisionConflictException + leaves the DB unchanged. Admin suite 81/81 passing (was 76, +5). Admin project builds 0 errors. Task #199 captures the deferred Playwright E2E smoke — drag a line onto a different area in a real browser, assert preview modal contents, click Confirm, assert the line row shows the new area. That PR stands up a new tests/ZB.MOM.WW.OtOpcUa.Admin.E2ETests project with Playwright + WebApplicationFactory + seeded InMemory DbContext.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:30:48 -04:00
543665dedd Merge pull request (#137) - DiffViewer refactor 2026-04-19 22:25:24 -04:00
Joseph Doherty
c8a38bc57b DiffViewer refactor — 6-section plugin pattern + 1000-row cap. Closes task #156 (Phase 6.4 Stream C). Replaces the flat single-table rendering that mixed Namespace/DriverInstance/Equipment/Tag rows into one untyped list with a per-section-card layout that makes draft review actually scannable on non-trivial diffs. New DiffSection.razor reusable component encapsulates the per-section rendering — card header shows Title + Description + a three-badge summary (+added / −removed / ~modified plus a "no changes" grey badge when the section is empty) so operators can glance at a six-card page and see what areas of the draft actually shifted before drilling into any one table. Hard row-cap at DefaultRowCap=1000 per section lives inside the component so a pathological draft (e.g. 20k tags churned by a block rebuild) can't freeze the browser on render — excess rows are silently dropped with a yellow warning banner that surfaces "Showing the first 1000 of N rows" + a pointer to run sp_ComputeGenerationDiff directly for the full set. Body max-height: 400px + overflow-y: auto gives each section its own scroll region so one big section doesn't push the others off screen. DiffViewer.razor refactored to a static Sections table driving a single foreach that instantiates one DiffSection per known TableName. Sections listed in author-order (Namespace → DriverInstance → Equipment → Tag → UnsLine → NodeAcl) — six entries matching the task acceptance criterion. The first four correspond to what sp_ComputeGenerationDiff currently emits; the last two (UnsLine + NodeAcl) render as empty "no changes" cards today + will light up when the proc is extended (tracked in task #196 for NodeAcl; UnsLine proc extension is a natural follow-up since UnsImpactAnalyzer already tracks UNS moves). RowsFor(tableName) replaces the prior flat table — each section filters the overall DiffRow list by its TableName so the proc output format stays stable. Header-bar summary at the top of the page now reads "N rows across M of 6 sections" so operators see overall change weight at a glance before scanning. Two Razor-specific fixes landed along the way: loop variable renamed from section to sec because @section collides with the Razor section directive + trips RZ2005; helper method renamed from Group to RowsFor because the Razor generator gets confused by a parameter-flowing method whose name clashes with LINQ's Group extension (the source-gen output referenced TypeCheck<T> with no argument). Admin project builds 0 errors; Admin.Tests suite 76/76 (unchanged — the refactor is structural + no service-layer logic changed, so the existing DraftValidator + EquipmentService + AdminServicesIntegrationTests cover the consuming paths). No bUnit in this project so the cap behavior isn't unit-tested at the component level; DiffSection.OnParametersSet is small + deterministic (int counts + Take(RowCap)) + reviewed before ship.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:23:22 -04:00
cecb84fa5d Merge pull request (#136) - Admin RedundancyTab 2026-04-19 22:16:20 -04:00
Joseph Doherty
13d5a7968b Admin RedundancyTab — per-cluster read-only topology view. Closes the UI slice of task #149 (Phase 6.3 Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR); the OpenTelemetry metrics + RoleChanged SignalR push are split into new follow-up task #198 because each is a structural add that deserves its own test matrix + NuGet-dep decision rather than riding this UI PR. New /clusters/{ClusterId} Redundancy tab slotted between ACLs and Audit in the existing ClusterDetail tab bar. Shows each ClusterNode row in the cluster with columns Node / Role (Primary green, Secondary blue, Standalone primary-blue badge) / Host / OPC UA port / ServiceLevel base / ApplicationUri (text-break so the long urn: doesn't blow out the table) / Enabled badge / Last seen (relative age via the same FormatAge helper as Hosts.razor, with a yellow "Stale" chip once LastSeenAt crosses the 30s threshold shared with HostStatusService.StaleThreshold — a missed heartbeat plus clock-skew buffer). Four summary cards above the table — total Nodes, Primary count, Secondary count, Stale count. Two guard-rail alerts: (a) red "No Primary or Standalone" when the cluster has no authoritative write target (all rows are Secondaries — read-only until one is promoted by the server-side RedundancyCoordinator apply-lease flow); (b) red "Split-brain" when >1 Primary exists — apply-lease enforcement at the coordinator level should have made this impossible, so the alert implies a hand-edited DB row + an investigation. New ClusterNodeService with ListByClusterAsync (ordered by ServiceLevelBase descending so Primary rows with higher base float to the top) + a static IsStale predicate matching HostStatusService's 30s convention. DI-registered alongside the existing scoped services in Program.cs. Writes (role swap, enable/disable) are deliberately absent from the service — they go through the RedundancyCoordinator apply-lease flow on the server side + direct DB mutation from Admin would race with it. New ClusterNodeServiceTests covering IsStale across null/recent/old LastSeenAt + ListByClusterAsync ordering + cluster filter. 4/4 new tests passing; full Admin.Tests suite 76/76 (was 72 before this PR, +4). Admin project builds 0 errors. Task #198 captures the deferred work: (1) OpenTelemetry Meter for primary/secondary/stale counts + role_transition counter with from/to/node tags + OTLP exporter config; (2) RoleChanged SignalR push — extend FleetStatusPoller to detect RedundancyRole changes on ClusterNode rows + emit a RoleChanged hub message so the RedundancyTab refreshes instantly instead of on-page-load polling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:14:25 -04:00
d1686ed82d Merge pull request (#135) - Equipment CSV import UI 2026-04-19 22:02:36 -04:00
Joseph Doherty
ac69a1c39d Equipment CSV import UI — Stream B.3/B.5 operator page + EquipmentTab "Import CSV" button. Closes the UI slice of task #163 (Phase 6.4 Stream B.3/B.5); the ExternalIdReservation merge follow-up inside FinaliseBatchAsync is split into new task #197 so it gets a proper concurrent-insert test matrix rather than riding this UI PR. New /clusters/{ClusterId}/draft/{GenerationId}/import-equipment page driving the full staged-import flow end-to-end. Operator selects a driver instance + UNS line (both scoped to the draft generation via DriverInstanceService.ListAsync + UnsService.ListLinesAsync dropdowns), pastes or uploads a CSV (InputFile with 5 MiB cap so pathological files can't OOM the server), clicks Parse — EquipmentCsvImporter.Parse runs + shows two side-by-side cards (accepted rows in green with ZTag/Machine/Name/Line columns, rejected rows in red with line-number + reason). Click Stage + Finalise and the page calls CreateBatchAsync → StageRowsAsync → FinaliseBatchAsync in sequence using the authenticated user's identity as CreatedBy; on success, 600ms banner then NavigateTo back to the draft editor so operator sees the newly-imported rows in EquipmentTab without a manual refresh. Parse errors (missing version marker, bad header, malformed CSV) surface InvalidCsvFormatException.Message inline alongside the Parse button — no page reload needed to retry. Finalise errors surface the service-layer exception message (ImportBatchNotFoundException / ImportBatchAlreadyFinalisedException / any DbUpdate* exception from the atomic transaction) so operator sees exactly why the finalise rejected before the tx rolled back. EquipmentTab gains an "Import CSV…" button next to "Add equipment" that NavigateTo's the new page; it needs a ClusterId parameter to build the URL so the @code block adds [Parameter] string ClusterId, and DraftEditor now passes ClusterId="@ClusterId" alongside the existing GenerationId. EquipmentImportBatchService was already implemented in Phase 6.4 Stream B.4 but missing from the Admin DI container — this PR adds AddScoped so the @inject resolves. The FinaliseBatch docstring explicitly defers ExternalIdReservation merge as a narrower follow-up with a concurrent-insert test matrix — task #197 captures that work. For now the finalise may surface a DB-level UNIQUE-constraint violation if a ZTag conflict exists at commit time; the UI shows the raw message + the batch + staged rows are still in the DB for re-use once the conflict is resolved. Admin project builds 0 errors; Admin.Tests 72/72 passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:00:40 -04:00
30714831fa Merge pull request (#134) - Admin RoleGrants page 2026-04-19 21:48:14 -04:00
Joseph Doherty
44d4448b37 Admin RoleGrants page — LDAP-group → Admin-role mapping CRUD. Closes the RoleGrantsTab slice of task #144 (Phase 6.2 Stream D follow-up); the remaining three sub-items (Probe-this-permission on AclsTab, SignalR invalidation on role/ACL changes, draft-diff ACL section) are split into new follow-up task #196 so each can ship independently. The permission-trie evaluator + ILdapGroupRoleMappingService already exist from Phase 6.2 Streams A + B — this PR adds the consuming UI + the DI registration that was missing. New /role-grants page at Components/Pages/RoleGrants.razor registered in MainLayout's sidebar next to Certificates. Lists every LdapGroupRoleMapping row with columns LDAP group / Role / Scope (Fleet-wide or Cluster:X) / Created / Notes / Revoke. Add-grant form takes LDAP group DN + AdminRole dropdown (ConfigViewer, ConfigEditor, FleetAdmin) + Fleet-wide checkbox + Cluster dropdown (disabled when Fleet-wide checked) + optional Notes. Service-layer invariants — IsSystemWide=true + ClusterId=null, or IsSystemWide=false + ClusterId populated — enforced in ValidateInvariants; UI catches InvalidLdapGroupRoleMappingException and displays the message in a red alert. ILdapGroupRoleMappingService was present in the Configuration project from Stream A but never registered in the Admin DI container — this PR adds the AddScoped registration so the injection can resolve. Control-plane/data-plane separation note rendered in an info banner at the top of the page per decision #150 (these grants do NOT govern OPC UA data-path authorization; NodeAcl rows are read directly by the permission-trie evaluator without consulting role mappings). Admin project builds 0 errors; Admin.Tests 72/72 passing. Task #196 created to track: (1) AclsTab Probe-this-permission form that takes (ldap group, node path, permission flag) and runs it through the permission trie, showing which row granted it + the actual resolved grant; (2) SignalR invalidation — push a RoleGrantsChanged event when rows are created/deleted so connected Admin sessions reload without polling, ditto NodeAclChanged on ACL writes; (3) DiffViewer ACL section — show NodeAcl + LdapGroupRoleMapping deltas between draft + published alongside equipment/uns diffs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:46:21 -04:00
572f8887e4 Merge pull request (#133) - IdentificationFields editor + edit mode 2026-04-19 21:43:18 -04:00
Joseph Doherty
2acea08ced Admin Equipment editor — IdentificationFields component + edit mode + three missing OPC 40010 fields. Closes the UI-editor slice of task #159 (Phase 6.4 Stream D remaining); the DriverNodeManager wire-in + ACL integration test are split into a new follow-up task #195 because they're blocked on a prerequisite that hasn't shipped — the DriverNodeManager does not currently materialize Equipment nodes at all (NodeScopeResolver has an explicit "A future resolver will..." TODO in its decomposition docstring). Shipping the IdentificationFolderBuilder call before the parent walker exists would wire a call that no code path hits, so the wire-in is deferred until the Equipment node walker lands first. New IdentificationFields.razor reusable component renders the 9-field decision #139 grid in a Bootstrap 3-column layout — Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction (InputNumber), AssetLocation, ManufacturerUri (placeholder https://…), DeviceManualUri (placeholder https://…). Takes a required Equipment parameter + 2-way binds every field; no state of its own. Three fields that were missing from the old inline form — AssetLocation, ManufacturerUri, DeviceManualUri — now present, matching IdentificationFolderBuilder.FieldNames exactly. EquipmentTab.razor refactored to consume the component in both create + edit flows. Each table row gains an Edit button next to Remove. StartEdit clones the row into _draft so Cancel doesn't mutate the displayed list row with in-flight edits; on Save, UpdateAsync persists through EquipmentService's existing update path which already handles all 9 Identification fields. SaveAsync branches on _editMode — create still derives EquipmentId from a fresh Uuid via DraftValidator per decision #125, edit keeps the original EquipmentId + EquipmentUuid (immutable once set). FormName renamed equipment-form (was new-equipment) to work for both flows. Admin project builds 0 errors; Admin.Tests 72/72 passing. No new tests shipped — this PR is strictly a Razor-component refactor + two new bound fields + an Edit branch; the existing EquipmentService tests cover both the create + update paths. Task #195 created to track the blocked server-side work: call IdentificationFolderBuilder.Build from DriverNodeManager once the Equipment walker exists, plus an integration test browsing Equipment/Identification as an unauthorized user asserting BadUserAccessDenied per the builder's cross-reference note in docs/v2/acl-design.md §Identification.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:41:13 -04:00
49f6c9484e Merge pull request (#132) - Admin /hosts red-badge + Polly telemetry observer 2026-04-19 21:38:11 -04:00
Joseph Doherty
d06cc01a48 Admin /hosts red-badge + resilience columns + Polly telemetry observer. Closes task #164 (the remaining slice of Phase 6.1 Stream E.3 after the earlier publisher + hub PR). Three cooperating pieces wired together so the operator-facing /hosts table actually reflects the live Polly counters that the pipeline builder is producing. DriverResiliencePipelineBuilder gains an optional DriverResilienceStatusTracker ctor param — when non-null, every built pipeline wires Polly's OnRetry/OnOpened/OnClosed strategy-options callbacks into the tracker. OnRetry → tracker.RecordFailure (so ConsecutiveFailures climbs per retry), OnOpened → tracker.RecordBreakerOpen (stamps LastCircuitBreakerOpenUtc), OnClosed → tracker.RecordSuccess (resets the failure counter once the target recovers). Absent tracker = silent, preserving the unit-test constructor path + any deployment that doesn't care about resilience observability. Cancellation stays excluded from the failure count via the existing ShouldHandle predicate. HostStatusService.HostStatusRow extends with four new fields — ConsecutiveFailures, LastCircuitBreakerOpenUtc, CurrentBulkheadDepth, LastRecycleUtc — populated via a second LEFT JOIN onto DriverInstanceResilienceStatuses keyed on (DriverInstanceId, HostName). LEFT JOIN because brand-new hosts haven't been sampled yet; a missing row means zero failures + never-opened breaker, which is the correct default. New FailureFlagThreshold constant (=3, matches plan decision #143's conservative half-of-breaker convention) + IsFlagged predicate so the UI can pre-warn before the breaker actually trips. Hosts.razor paints three new columns between State and Last-transition — Fail# (bold red when flagged), In-flight (bulkhead-depth proxy), Breaker-opened (relative age). Per-row "Flagged" red badge alongside State when IsFlagged is true. Above the first cluster table, a red alert banner summarises the flagged-host count when ≥1 host is flagged, so operators see the problem before scanning rows. Three new tests in DriverResiliencePipelineBuilderTests — Tracker_RecordsFailure_OnEveryRetry verifies ConsecutiveFailures reaches RetryCount after a transient-forever operation, Tracker_StampsBreakerOpen_WhenBreakerTrips verifies LastBreakerOpenUtc is set after threshold failures on a Write pipeline, Tracker_IsolatesCounters_PerHost verifies one dead host does not leak failure counts into a healthy sibling. Full suite — Core.Tests 14/14 resilience-builder tests passing (11 existing + 3 new), Admin.Tests 72/72 passing, Admin project builds 0 errors. SignalR live push of status changes + browser visual review are deliberately left to a follow-up — this PR keeps the structural change minimal (polling refresh already exists in the page's 10s timer; SignalR would be a structural add that touches hub registration + client subscription).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:35:54 -04:00
5536e96b46 Merge pull request (#131) - AbCip UDT Template reader 2026-04-19 21:23:34 -04:00
Joseph Doherty
ece530d133 AB CIP UDT Template Object shape reader. Closes the shape-reader half of task #179. CipTemplateObjectDecoder (pure-managed) parses the Read Template blob per Rockwell CIP Vol 1 + libplctag ab/cip.c handle_read_template_reply — 12-byte header (u16 member_count + u16 struct_handle + u32 instance_size + u32 member_def_size) followed by memberCount × 8-byte member blocks (u16 info with bit-15 struct flag + lower-12-bit type code matching the Symbol Object encoding, u16 array_size, u32 struct_offset) followed by semicolon-terminated strings (UDT name first, then one per member). ParseSemicolonTerminatedStrings handles the observed firmware variations — name;\0 vs name; delimiters, optional null/space padding after the semicolon, trailing-name-without-semicolon corner case. Struct-flag members decode as AbCipDataType.Structure; unknown atomic codes fall back to Structure so the shape remains valid even with unrecognised members. Zero member count + short buffer both return null; missing member names yield <member_N> placeholders. IAbCipTemplateReader + IAbCipTemplateReaderFactory abstraction — one call per template instance id returning the raw blob. LibplctagTemplateReader is the production implementation creating a libplctag Tag with name @udt/{templateId} + handing the buffer to the decoder. AbCipDriver ctor gains optional templateReaderFactory parameter (defaults to LibplctagTemplateReaderFactory) + new internal FetchUdtShapeAsync that — checks AbCipTemplateCache first, misses call the reader + decode + cache, template-read exceptions + decode failures return null so callers can fall back to declaration-driven fan-out without the whole discovery blowing up. OperationCanceledException rethrows for shutdown propagation. Unknown device host returns null without attempting a fetch. FlushOptionalCachesAsync empties the cache so a subsequent fetch re-reads. 16 new decoder tests — simple two-member UDT, struct-member flag → Structure, array member ArrayLength, 6-member mixed-type with correct offsets, unknown type code → Structure, zero member count → null, short buffer → null, missing member name → placeholder, ParseSemicolonTerminatedStrings theory across 5 shapes. 6 new AbCipFetchUdtShapeTests exercising the driver integration via reflection (method is internal) — happy-path decode + cache, different template ids get separate fetches, unknown device → null without reader creation, decode failure returns null + doesn't cache (next call retries), reader exception returns null, FlushOptionalCachesAsync clears the cache. Total AbCip unit tests now 211/211 passing (+19 from the @tags merge's 192); full solution builds 0 errors; other drivers untouched. Whole-UDT read optimization (single libplctag call returning the packed buffer + client-side member decode using the template offsets) is left as a follow-up — requires rethinking the per-tag read path + careful hardware validation; current per-member fan-out still works correctly, just with N round-trips instead of 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:21:42 -04:00
b55cef5f8b Merge pull request (#130) - AbCip @tags walker 2026-04-19 21:15:16 -04:00
Joseph Doherty
088c4817fe AB CIP @tags walker — CIP Symbol Object decoder + LibplctagTagEnumerator. Closes task #178. CipSymbolObjectDecoder (pure-managed, no libplctag dep) parses the raw Symbol Object (class 0x6B) blob returned by reading the @tags pseudo-tag into an enumerable sequence of AbCipDiscoveredTag records. Entry layout per Rockwell CIP Vol 1 + Logix 5000 CIP Programming Manual 1756-PM019, cross-checked against libplctag's ab/cip.c handle_listed_tags_reply — u32 instance-id + u16 symbol-type + u16 element-length + 3×u32 array-dims + u16 name-length + name[len] + even-pad. Symbol-type lower 12 bits carry the CIP type code (0xC1 BOOL, 0xC2 SINT, …, 0xD0 STRING), bit 12 is the system-tag flag, bit 15 is the struct flag (when set lower 12 bits become the template instance id). Truncated tails stop decoding gracefully — caller keeps whatever parsed cleanly rather than getting an exception mid-walk. Program:-scope names (Program:MainProgram.StepIndex) are split via SplitProgramScope so the enumerator surfaces scope + simple name separately. 12 atomic type codes mapped (BOOL/SINT/INT/DINT/LINT/USINT/UINT/UDINT/ULINT/REAL/LREAL/STRING + DT/DATE_AND_TIME under Dt); unknown codes return null so the caller treats them as opaque Structure. LibplctagTagEnumerator is the real production walker — creates a libplctag Tag with name=@tags against the device's gateway/port/path, InitializeAsync + ReadAsync + GetBuffer, hands bytes to the decoder. Factory LibplctagTagEnumeratorFactory replaces EmptyAbCipTagEnumeratorFactory as the AbCipDriver default. AbCipDriverOptions gains EnableControllerBrowse (default false) matching the TwinCAT pattern — keeps the strict-config path for deployments where only declared tags should appear. When true, DiscoverAsync walks each device's @tags + emits surviving symbols under Discovered/ sub-folder. System-tag filter (AbCipSystemTagFilter shipped in PR 5) runs alongside the wire-layer system-flag hint. Tests — 18 new CipSymbolObjectDecoderTests with crafted byte arrays matching the documented layout — single-entry DInt, theory across 12 atomic type codes, unknown→null, struct flag override, system flag surface, Program:-scope split, multi-entry wire-order with even-pad, truncated-buffer graceful stop, empty buffer, SplitProgramScope theory across 6 shapes. 4 pre-existing AbCipDriverDiscoveryTests that tested controller-enumeration behavior updated with EnableControllerBrowse=true so they continue exercising the walker path (behavior unchanged from their perspective). Total AbCip unit tests now 192/192 passing (+26 from the RMW merge's 166); full solution builds 0 errors; other drivers untouched. Field validation note — the decoder layout matches published Rockwell docs + libplctag C source, but actual @tags responses vary slightly by controller firmware (some ship an older entry format with u16 array dims instead of u32). Any layout drift surfaces as gibberish names in the Discovered/ folder; field testing will flag that for a decoder patch if it occurs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:13:20 -04:00
91e6153b5d Merge pull request (#129) - Bit RMW pass 2 (AbCip+AbLegacy) 2026-04-19 20:36:21 -04:00
Joseph Doherty
00a428c444 RMW pass 2 — AbCip BOOL-within-DINT + AbLegacy bit-within-word. Closes task #181. AbCip — AbCipDriver.WriteAsync now detects BOOL writes with a bit index + routes them through WriteBitInDIntAsync: strip the .N suffix to form the parent DINT tag path (via AbCipTagPath with BitIndex=null + ToLibplctagName), get/create a cached parent IAbCipTagRuntime via EnsureParentRuntimeAsync (distinct from the bit-selector tag runtime so read + write target the DINT directly), acquire a per-parent-name SemaphoreSlim, Read → Convert.ToInt32 the current DINT → (current | 1<<bit) or (current & ~(1<<bit)) → Write via EncodeValue(DInt, updated). Per-parent lock prevents concurrent writers to the same DINT from losing updates — parallels Modbus + FOCAS pass 1. DeviceState gains ParentRuntimes dict + GetRmwLock helper + _rmwLocks ConcurrentDictionary. DisposeHandles now walks ParentRuntimes too. LibplctagTagRuntime.EncodeValue's BOOL-with-bitIndex branch stays as a defensive throw (message updated to point at the new driver-level dispatch) so an accidental bypass fails loudly rather than silently clobbering the whole DINT. AbLegacy — identical pattern for PCCC N-file bit writes. AbLegacyDriver.WriteAsync detects Bit with bitIndex + PMC letter not in {B, I, O} (B-file + I/O use their own bit-addressable semantics so don't RMW at N-file word level), routes through WriteBitInWordAsync which uses Int16 for the parent word, creates + caches a parent runtime with the suffix-stripped N7:0 address, acquires per-parent lock, RMW. DeviceState extended the same way as AbCip (ParentRuntimes + GetRmwLock). LibplctagLegacyTagRuntime.EncodeValue Bit-with-bitIndex branch points at the driver dispatch. Tests — 5 new AbCipBoolInDIntRmwTests (bit set ORs + preserves, bit clear ANDs + preserves, 8-way concurrent writes to same parent compose to 0xFF, different-parent writes get separate runtimes, repeat bit writes reuse the parent runtime init-count 1 + write-count 2), 4 new AbLegacyBitRmwTests (bit set preserves, bit clear preserves 0xFFF7, 8-way concurrent 0xFF, repeat writes reuse parent). Two pre-existing tests flipped — AbCipDriverWriteTests.Bit_in_dint_write_returns_BadNotSupported + AbLegacyReadWriteTests.Bit_within_word_write_rejected_as_BadNotSupported both now assert Good instead of BadNotSupported, renamed to _now_succeeds_via_RMW. Total tests — AbCip 166/166, AbLegacy 96/96, full solution builds 0 errors; Modbus + FOCAS + TwinCAT + other drivers untouched. Task #181 done across all four libplctag-backed + non-libplctag drivers (Modbus BitInRegister + AbCip BOOL-in-DINT + AbLegacy N-file bit + FOCAS PMC Bit — all with per-parent-word serialisation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:34:29 -04:00
07fd105ffc Merge pull request (#128) - Bit RMW pass 1 (Modbus+FOCAS) 2026-04-19 20:27:17 -04:00
Joseph Doherty
8c309aebf3 RMW pass 1 — Modbus BitInRegister + FOCAS PMC Bit write paths. First half of task #181 — the two drivers where read-modify-write is a clean protocol-level insertion (Modbus FC03/FC06 round-trip + FOCAS pmc_rdpmcrng / pmc_wrpmcrng round-trip). Per-driver SemaphoreSlim registry keyed on the parent word address serialises concurrent bit writes so two writers targeting different bits in the same word don't lose one another's update. Modbus — ModbusDriver gains WriteBitInRegisterAsync + _rmwLocks ConcurrentDictionary. WriteOneAsync routes BitInRegister (HoldingRegisters region only) through RMW ahead of the normal encode path. Read uses FC03 Read Holding Registers for 1 register at tag.Address, bit-op on the returned ushort via (current | 1<<bit) for set / (current & ~(1<<bit)) for clear, write back via FC06 Write Single Register. Per-address lock prevents concurrent bit writes to the same register from racing. Rejects out-of-range bits (0-15) with InvalidOperationException. EncodeRegister's BitInRegister branch repurposed as a defensive guard — if a non-RMW caller ever reaches it, throw so an unintended bypass stays loud rather than silently clobbering. FOCAS — FwlibFocasClient gains WritePmcBitAsync + _rmwLocks keyed on {addrType}:{byteAddr}. Driver-layer WriteAsync routes Bit writes with a bitIndex through the new path; other Pmc writes still hit the direct pmc_wrpmcrng path. RMW uses cnc_rdpmcrng + Byte dataType to grab the parent byte, bit-op with (current | 1<<bit) or (current & ~(1<<bit)), cnc_wrpmcrng to write back. Rejects out-of-range bits (0-7, FOCAS PMC bytes are 8-bit) with InvalidOperationException. EncodePmcValue's Bit branch now treats a no-bitIndex case as whole-byte boolean (non-zero / zero); bitIndex-present writes never hit this path because they dispatch to WritePmcBitAsync upstream. Tests — 5 new ModbusBitRmwTests + 4 new FocasPmcBitRmwTests + 1 renamed pre-existing test each covering — bit set preserves other bits, bit clear preserves other bits, concurrent bit writes to same word/byte compose correctly (8-parallel stress), bit writes on different parent words proceed without contention (4-parallel), sequential bit sets compose into 0xFF after all 8. Fake PmcRmwFake in FOCAS tests simulates the PMC byte storage + surfaces it through the IFocasClient contract so the test asserts driver-level behavior without needing Fwlib32.dll. FwlibNativeHelperTests.EncodePmcValue_Bit_throws_NotSupported_for_RMW_gap replaced with EncodePmcValue_Bit_without_bit_index_writes_byte_boolean reflecting the new behavior. ModbusDataTypeTests.BitInRegister_write_is_not_supported_in_PR24 renamed to BitInRegister_EncodeRegister_still_rejects_direct_calls; the message assertion updated to match the new defensive message. Modbus tests now 182/182, FOCAS tests now 119/119; full solution builds 0 errors; AbCip/AbLegacy/TwinCAT untouched (those get their RMW pass in a follow-up since libplctag bit access may need a parallel parent-word handle). Task #181 stays pending until that second pass lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:25:27 -04:00
d1ca0817e9 Merge pull request (#127) - TwinCAT symbol browser 2026-04-19 20:15:25 -04:00
Joseph Doherty
c95228391d TwinCAT follow-up — Symbol browser via AdsClient + SymbolLoaderFactory. Closes task #188. Adds ITwinCATClient.BrowseSymbolsAsync — IAsyncEnumerable yielding TwinCATDiscoveredSymbol (InstancePath + mapped TwinCATDataType + ReadOnly flag) from the target's flat symbol table. AdsTwinCATClient implementation uses SymbolLoaderFactory.Create(_client, new SymbolLoaderSettings(SymbolsLoadMode.Flat)) + iterates loader.Symbols, maps IEC 61131-3 type names (BOOL/SINT/INT/DINT/LINT/REAL/LREAL/STRING/WSTRING/TIME/DATE/DT/TOD + BYTE/WORD/DWORD/LWORD unsigned-word aliases) through MapSymbolTypeName, checks SymbolAccessRights.Write bit for writable vs read-only. Unsupported types (UDTs / function blocks / arrays / pointers) surface with DataType=null so callers can skip or recurse. TwinCATDriverOptions.EnableControllerBrowse — new bool, default false to preserve the strict-config path. When true, DiscoverAsync iterates each device's BrowseSymbolsAsync, filters via TwinCATSystemSymbolFilter (rejects TwinCAT_*, Constants.*, Mc_*, __*, Global_Version* prefixes + anything empty), skips null-DataType symbols, emits surviving symbols under a per-device Discovered/ sub-folder with InstancePath as both FullName + BrowseName + ReadOnly→ViewOnly/writable→Operate. Pre-declared tags from TwinCATDriverOptions.Tags always emit regardless. Browse failure is non-fatal — exception caught + swallowed, pre-declared tags stay in the address space, operators see the failure in driver health on next read. TwinCATSystemSymbolFilter static class mirrors AbCipSystemTagFilter's shape with TwinCAT-specific prefixes. Fake client updated — BrowseResults list for test setup + FireNotification-style single-invocation on each subscribe, ThrowOnBrowse flag for failure testing. 8 new unit tests — strict path emits only pre-declared when EnableControllerBrowse=false, browse enabled adds Discovered/ folder, filter rejects system prefixes, null-DataType symbols skipped, ReadOnly symbols surface ViewOnly, browse failure leaves pre-declared intact, SystemSymbolFilter theory (10 cases). Total TwinCAT unit tests now 110/110 passing (+17 from the native-notification merge's 93); full solution builds 0 errors; other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:13:33 -04:00
9ca80fd450 Merge pull request (#126) - FOCAS capabilities 2026-04-19 20:01:28 -04:00
Joseph Doherty
1d6015bc87 FOCAS PR 3 — ITagDiscovery + ISubscribable + IHostConnectivityProbe + IPerCallHostResolver. Completes the FOCAS driver — 7-interface capability set matching AbCip/AbLegacy/TwinCAT (minus IAlarmSource — Fanuc CNC alarms live in a different API surface, tracked as a future-phase concern). ITagDiscovery emits pre-declared tags under a FOCAS root + per-device sub-folder keyed on the canonical focas://host:port string with DeviceName fallback. Writable → Operate, non-writable → ViewOnly. No native FOCAS symbol browsing — CNCs don't expose a tag catalogue the way Logix or TwinCAT do; operators declare addresses explicitly. ISubscribable consumes the shared PollGroupEngine — 5th consumer of the engine after Modbus + AbCip + AbLegacy + TwinCAT-poll-mode. 100ms interval floor inherited. FOCAS has no native notification/subscription protocol (unlike TwinCAT ADS), so polling is the only option — every subscribed tag round-trips through cnc_rdpmcrng / cnc_rdparam / cnc_rdmacro on each tick. IHostConnectivityProbe uses the existing IFocasClient.ProbeAsync which in the real FwlibFocasClient calls cnc_statinfo (cheap handshake returning ODBST with tmmode/aut/run/motion/alarm state). Probe loop runs when Enabled=true, catches OperationCanceledException during shutdown, falls through to Stopped on exceptions, emits Running/Stopped transitions via OnHostStatusChanged with the canonical focas://host:port as the host-name key. Same-state spurious-event guard under per-device lock. IPerCallHostResolver maps tag full-ref to DeviceHostAddress for Phase 6.1 bulkhead/breaker keying per plan decision #144 — unknown refs fall back to first device, no devices → DriverInstanceId. ShutdownAsync now disposes PollGroupEngine + cancels/disposes per-device probe CTS + disposes cached clients. DeviceState gains ProbeLock / HostState / HostStateChangedUtc / ProbeCts matching the shape used by AbCip/AbLegacy/TwinCAT. 9 new unit tests in FocasCapabilityTests — discovery tag emission with correct SecurityClassification, subscription initial poll raises OnDataChange, shutdown cancels subscriptions, GetHostStatuses entry-per-device, probe Running / Stopped transitions, ResolveHost for known / unknown / no-devices paths. FocasScaffoldingTests updated with Probe.Enabled=false where the default factory would otherwise try to load Fwlib32.dll during the probe-loop spinup. Total FOCAS unit tests now 115/115 passing (+9 from PR 2's 106); full solution builds 0 errors; Modbus / AbCip / AbLegacy / TwinCAT / other drivers untouched. FOCAS driver is real-wire-capable end-to-end — read / write / discover / subscribe / probe / host-resolve for Fanuc FS 0i/16i/18i/21i/30i/31i/32i/Series 35i/Power Mate i controllers once deployment drops Fwlib32.dll beside the server. Closes task #120 subtask FOCAS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:59:37 -04:00
5cfb0fc6d0 Merge pull request (#125) - FOCAS R/W + real P/Invoke 2026-04-19 19:57:31 -04:00
Joseph Doherty
a2c7fda5f5 FOCAS PR 2 — IReadable + IWritable + real FwlibFocasClient P/Invoke. Closes task #193 early now that strangesast/fwlib provides the licensed DLL references. Skips shipping with the Unimplemented stub as the default — FwlibFocasClientFactory is now the production default, UnimplementedFocasClientFactory stays as an opt-in for tests/deployments without FWLIB access. FwlibNative — narrow P/Invoke surface for the 7 calls the driver actually makes: cnc_allclibhndl3 (open Ethernet handle), cnc_freelibhndl (close), pmc_rdpmcrng + pmc_wrpmcrng (PMC range I/O), cnc_rdparam + cnc_wrparam (CNC parameters), cnc_rdmacro + cnc_wrmacro (macro variables), cnc_statinfo (probe). DllImport targets Fwlib32.dll; deployment places it next to the executable or on PATH. IODBPMC/IODBPSD/ODBM/ODBST marshaled with LayoutKind.Sequential + Pack=1 + fixed byte-array unions (avoids LayoutKind.Explicit complexity; managed-side BitConverter extracts typed values from the byte buffer). Internal helpers FocasPmcAddrType.FromLetter (G=0/F=1/Y=2/X=3/A=4/R=5/T=6/K=7/C=8/D=9/E=10 per Fanuc FOCAS/2 spec) + FocasPmcDataType.FromFocasDataType (Byte=0 / Word=1 / Long=2 / Float=4 / Double=5) exposed for testing without the DLL loaded. FwlibFocasClient is the concrete IFocasClient backed by P/Invoke. Construction is licence-safe — .NET P/Invoke is lazy so instantiating the class does NOT load Fwlib32.dll; DLL loads on first wire call (Connect/Read/Write/Probe). When missing, calls throw DllNotFoundException which the driver surfaces as BadCommunicationError via the normal exception path. Session-scoped handle from cnc_allclibhndl3; Dispose calls cnc_freelibhndl. Dispatch on FocasAreaKind — Pmc reads use pmc_rdpmcrng with the right ADR_* + data-type codes + parses the union via BinaryPrimitives LittleEndian, Parameter reads use cnc_rdparam + IODBPSD, Macro reads use cnc_rdmacro + compute scaled double as McrVal / 10^DecVal. Write paths mirror reads. PMC Bit writes throw NotSupportedException pointing at task #181 (read-modify-write gap — same as Modbus / AbCip / AbLegacy / TwinCAT). Macro writes accept int + pass decimal-point count 0 (decimal precision writes are a future enhancement). Probe calls cnc_statinfo with ODBST result. Driver wiring — FocasDriver now IDriver + IReadable + IWritable. Per-device connection caching via EnsureConnectedAsync + DeviceState.Client. ReadAsync/WriteAsync dispatch through the injected IFocasClient — ordered snapshots preserve per-tag status, OperationCanceledException rethrows, FormatException/InvalidCastException → BadTypeMismatch, OverflowException → BadOutOfRange, NotSupportedException → BadNotSupported, anything else → BadCommunicationError + Degraded health. Connect-failure disposes the half-open client. ShutdownAsync disposes every cached client. Default factory switched — constructor now defaults to FwlibFocasClientFactory (backed by real Fwlib32.dll) rather than UnimplementedFocasClientFactory. UnimplementedFocasClientFactory stays as an opt-in. 41 new tests — 14 in FocasReadWriteTests (ordered unknown-ref handling, successful PMC/Parameter/Macro reads routing through correct FocasAreaKind, repeat-read reuses connection, FOCAS error mapping, exception paths, batched order across areas, non-writable rejection, successful write logging, status mapping, batch ordering, cancellation, shutdown disposes), 27 in FwlibNativeHelperTests (12 letter-mapping cases + 3 unknown rejections + 6 data-type mapping + 4 encode helpers + Bit-write NotSupported). Total FOCAS unit tests now 106/106 passing (+41 from PR 1's 65); full solution builds 0 errors; Modbus / AbCip / AbLegacy / TwinCAT / other drivers untouched. FOCAS driver is real-wire-capable from day one — deployment drops Fwlib32.dll beside the server + driver talks to live FS 0i/16i/18i/21i/30i/31i/32i controllers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:55:37 -04:00
c13fe8f587 Merge pull request (#124) - FOCAS scaffolding 2026-04-19 19:49:47 -04:00
Joseph Doherty
285799a954 FOCAS PR 1 — Scaffolding + Core (FocasDriver skeleton + address parser + stub client). New Driver.FOCAS project for Fanuc CNC controllers (FS 0i/16i/18i/21i/30i/31i/32i/Series 35i/Power Mate i) talking via the Fanuc FOCAS/2 protocol. No NuGet reference to a FOCAS library — FWLIB (Fwlib32.dll) is Fanuc-proprietary + per-customer licensed + cannot be legally redistributed, so the driver is designed from the start to accept an IFocasClient supplied by the deployment side. Default IFocasClientFactory is UnimplementedFocasClientFactory which throws with a clear deployment-docs pointer at Create time so misconfigured servers fail fast rather than mysteriously hanging. Matches the pattern other drivers use for swappable wire layers (Modbus IModbusTransport, AbCip IAbCipTagFactory, TwinCAT ITwinCATClientFactory) — but uniquely, FOCAS ships without a production factory because of licensing. FocasHostAddress parses focas://{host}[:{port}] canonical form with default port 8193 (Fanuc-reserved FOCAS Ethernet port). Default-port stripping on ToString for roundtrip stability. Case-insensitive scheme. Rejects wrong scheme, empty body, invalid port, non-numeric port. FocasAddress handles the three addressing spaces a FOCAS driver touches — PMC (letter + byte + optional bit, X/Y for IO, F/G for PMC-CNC signals, R for internal relay, D for data table, C for counter, K for keep relay, A for message display, E for extended relay, T for timer, with .N bit syntax 0-7), CNC parameters (PARAM:n for a parameter number, PARAM:n/N for bit 0-31 of a parameter), macro variables (MACRO:n). Rejects unknown PMC letters, negative numbers, out-of-range bits (PMC 0-7, parameter 0-31), non-numeric fragments. FocasDataType — Bit / Byte / Int16 / Int32 / Float32 / Float64 / String covering the atomic types PMC reads + CNC parameters + macro variables return. ToDriverDataType widens to the Int32/Float32/Float64/Boolean/String surface. FocasStatusMapper covers the FWLIB EW_* return-code family documented in the FOCAS/1 + FOCAS/2 references — EW_OK=0, EW_FUNC=1 → BadNotSupported, EW_OVRFLOW=2/EW_NUMBER=3/EW_LENGTH=4 → BadOutOfRange, EW_PROT=5/EW_PASSWD=11 → BadNotWritable, EW_NOOPT=6/EW_VERSION=-9 → BadNotSupported, EW_ATTRIB=7 → BadTypeMismatch, EW_DATA=8 → BadNodeIdUnknown, EW_PARITY=9 → BadCommunicationError, EW_BUSY=-1 → BadDeviceFailure, EW_HANDLE=-8 → BadInternalError, EW_UNEXP=-10/EW_SOCKET=-16 → BadCommunicationError. IFocasClient + IFocasClientFactory abstraction — ConnectAsync, IsConnected, ReadAsync returning (value, status) tuple, WriteAsync returning status, ProbeAsync for IHostConnectivityProbe. Deployment supplies the real factory; driver assembly stays licence-clean. FocasDriverOptions + FocasDeviceOptions + FocasTagDefinition + FocasProbeOptions — one instance supports N CNCs, tags cross-key by HostAddress + use canonical FocasAddress strings. FocasDriver implements IDriver only (PRs 2-3 add read/write/discover/subscribe/probe/resolver). InitializeAsync parses each device HostAddress + fails fast on malformed strings → Faulted health. 65 new unit tests in FocasScaffoldingTests covering — 5 valid host forms + 8 invalid + default-port-strip ToString, 12 valid PMC addresses across all 11 canonical letters + 3 parameter forms with + without bit + 2 macro forms, 10 invalid address shapes, canonical roundtrip theory, data-type mapping theory, FWLIB EW_* status mapping theory (9 codes + unknown → generic), DriverType, multi-device Initialize + address parsing, malformed-address fault, shutdown, default factory throws NotSupportedException with deployment pointer + Fwlib32.dll mention. Total project count 31 src + 20 tests; full solution builds 0 errors. Other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:47:52 -04:00
9da578d5a5 Merge pull request (#123) - TwinCAT native notifications 2026-04-19 18:51:39 -04:00
Joseph Doherty
6c5b202910 TwinCAT follow-up — Native ADS notifications for ISubscribable. Closes task #189 — upgrades TwinCATDriver's subscription path from polling (shared PollGroupEngine) to native AdsClient.AddDeviceNotificationExAsync so the PLC pushes changes on its own cycle rather than the driver polling. Strictly better for latency + CPU — TC2 and TC3 runtimes notify on value change with sub-millisecond latency from the PLC cycle. ITwinCATClient gains AddNotificationAsync — takes symbolPath + TwinCATDataType + optional bitIndex + cycleTime + onChange callback + CancellationToken; returns an ITwinCATNotificationHandle whose Dispose tears the notification down on the wire. Bit-within-word reads supported — the parent word value arrives via the notification, driver extracts the bit before invoking the callback (same ExtractBit path as the read surface from PR 2). AdsTwinCATClient — subscribes to AdsClient.AdsNotificationEx in the ctor, maintains a ConcurrentDictionary<uint, NotificationRegistration> keyed on the server-side notification handle. AddDeviceNotificationExAsync returns Task<ResultHandle> with Handle + ErrorCode; non-NoError throws InvalidOperationException so the driver can catch + retry. Notification event args carry Handle + Value + DataType; lookup in _notifications dict routes the value through any bit-extraction + calls the consumer callback. Consumer-side exceptions are swallowed so a misbehaving callback can't crash the ADS notification thread. Dispose unsubscribes from AdsNotificationEx + clears the dict + disposes AdsClient. NotificationRegistration is ITwinCATNotificationHandle — Dispose fires DeleteDeviceNotificationAsync as fire-and-forget with CancellationToken.None (caller has already committed to teardown; blocking would slow shutdown). TwinCATDriverOptions.UseNativeNotifications — new bool, default true. When true the driver uses native notifications; when false it falls through to the shared PollGroupEngine (same semantics as other libplctag-backed drivers, also a safety valve for targets with notification limits). TwinCATDriver.SubscribeAsync dual-path — if UseNativeNotifications false delegate into _poll.Subscribe (unchanged behavior from PR 3). If true, iterate fullReferences, resolve each to its device's client via EnsureConnectedAsync (reuses PR 2's per-device connection cache), parse the SymbolPath via TwinCATSymbolPath (preserves bit-in-word support), call ITwinCATClient.AddNotificationAsync with a closure over the FullReference (not the ADS symbol — OPC UA subscribers addressed the driver-side name). Per-registration callback bridges (_, value) → OnDataChange event with a fresh DataValueSnapshot (Good status, current UtcNow timestamps). Any mid-registration failure triggers a try/catch that disposes every already-registered handle before rethrowing, keeping the driver in a clean never-existed state rather than half-registered. UnsubscribeAsync dispatches on handle type — NativeSubscriptionHandle disposes all its cached ITwinCATNotificationHandles; anything else delegates to _poll.Unsubscribe for the poll fallback. ShutdownAsync tears down native subs first (so AdsClient-level cleanup happens before the client itself disposes), then PollGroupEngine, then per-device probe CTS + client. NativeSubscriptionHandle DiagnosticId prefixes with twincat-native-sub- so Admin UI + logs can distinguish the paths. 9 new unit tests in TwinCATNativeNotificationTests — native subscribe registers one notification per tag, pushed value via FireNotification fires OnDataChange with the right FullReference (driver-side, not ADS symbol), unsubscribe disposes all notifications, unsubscribe halts future notifications, partial-failure cleanup via FailAfterNAddsFake (first succeeds, second throws → first gets torn down + Notifications count returns to 0 + AddCallCount=2 proving the test actually exercised both calls), shutdown disposes subscriptions, poll fallback works when UseNativeNotifications=false (no native handles created + initial-data push still fires), handle DiagnosticId distinguishes native vs poll. Existing poll-mode ISubscribable tests in TwinCATCapabilityTests updated with UseNativeNotifications=false so they continue testing the poll path specifically — both poll + native paths have test coverage now. TwinCATDriverTests got Probe.Enabled=false added because the default factory creates a real AdsClient which was flakily affected by parallel test execution sharing AMS router state. Total TwinCAT unit tests now 93/93 passing (+8 from PR 3's 85 counting the new native tests + 2 existing tests that got options tweaks). Full solution builds 0 errors; Modbus / AbCip / AbLegacy / other drivers untouched. TwinCAT driver is now feature-complete end-to-end — read / write / discover / native-subscribe / probe / host-resolve, with poll-mode as a safety valve. Unblocks closing task #120 for TwinCAT; remaining sub-task: FOCAS + task #188 (symbol-browsing — lower priority than FOCAS since real config flows still use pre-declared tags).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:49:48 -04:00
a0112ddb43 Merge pull request (#122) - TwinCAT capabilities 2026-04-19 18:38:44 -04:00
Joseph Doherty
aeb28cc8e7 TwinCAT PR 3 — ITagDiscovery + ISubscribable + IHostConnectivityProbe + IPerCallHostResolver. Completes the TwinCAT driver — 7-interface capability set matching AbCip / AbLegacy (minus IAlarmSource, same deferral). ITagDiscovery emits pre-declared tags under TwinCAT/device-host folder with DeviceName fallback to HostAddress; Writable→Operate / non-writable→ViewOnly. Symbol-browsing via AdsClient.ReadSymbolsAsync / ReadSymbolInfoAsync deferred to a follow-up (same shape as the @tags deferral for AbCip — needs careful traversal of the TwinCAT symbol table + type graph which the ReadSymbolsAsync API does expose but adds enough scope to warrant its own PR). ISubscribable consumes the shared PollGroupEngine — 4th consumer after Modbus + AbCip + AbLegacy. TwinCAT supports native ADS notifications (AddDeviceNotification) which would be strictly superior to polling, but plumbing through OPC UA semantics + the PollGroupEngine abstraction would require a parallel sampling path; poll-first matches the cross-driver pattern + gets the driver shippable. Follow-up task for native-notification upgrade tracked after merge. IHostConnectivityProbe — per-device probe loop using ITwinCATClient.ProbeAsync which wraps AdsClient.ReadStateAsync (cheap handshake that returns the target's AdsState, succeeds when router + target both respond). Success transitions to Running, any exception or probe-false to Stopped. Same lazy-connect + dispose-on-failure pattern as the read/write path — device state reconnects cleanly after a transient. IPerCallHostResolver maps tag full-ref to DeviceHostAddress for Phase 6.1 (DriverInstanceId, ResolvedHostName) bulkhead/breaker keying per plan decision #144; unknown refs fall back to first device, no devices → DriverInstanceId. ShutdownAsync disposes PollGroupEngine + cancels/disposes every probe CTS + disposes every cached client. DeviceState extended with ProbeLock / HostState / HostStateChangedUtc / ProbeCts matching AbCip/AbLegacy shape. 10 new tests in TwinCATCapabilityTests — discovery tag emission with correct SecurityClassification, subscription initial poll raises OnDataChange, shutdown cancels subscriptions, GetHostStatuses entry-per-device, probe Running transition on ProbeResult=true, probe Stopped on ProbeResult=false, probe disabled when Enabled=false, ResolveHost for known/unknown/no-devices paths. Total TwinCAT unit tests now 85/85 passing (+10 from PR 2's 75); full solution builds 0 errors; other drivers untouched. TwinCAT driver complete end-to-end — any TC2/TC3 AMS target reachable through a router is now shippable with read/write/discover/subscribe/probe/host-resolve, feature-parity with AbCip/AbLegacy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:36:55 -04:00
2d5aaf1eda Merge pull request (#121) - TwinCAT R/W 2026-04-19 18:34:52 -04:00
Joseph Doherty
28e3470300 TwinCAT PR 2 — IReadable + IWritable. ITwinCATClient + ITwinCATClientFactory abstraction — one client per AMS target, reused across reads/writes/probes. Shape differs from AbCip/AbLegacy where libplctag handles are per-tag — TwinCAT's AdsClient is a single connection with symbolic reads/writes issued against it, so the abstraction is coarser. AdsTwinCATClient is the default implementation wrapping Beckhoff.TwinCAT.Ads's AdsClient — ConnectAsync calls AdsClient.Connect(AmsNetId.Parse(netId), port) after setting Timeout in ms; ReadValueAsync dispatches TwinCATDataType to the CLR Type via MapToClrType (bool/sbyte/byte/short/ushort/int/uint/long/ulong/float/double/string/uint for time types) and calls AdsClient.ReadValueAsync(symbol, type, ct) which returns ResultAnyValue; unwraps .Value + .ErrorCode and maps non-NoError codes via TwinCATStatusMapper.MapAdsError. BOOL-within-word reads extract the bit after the underlying word read using ExtractBit over short/ushort/int/uint/long/ulong. WriteValueAsync converts the boxed value via ConvertForWrite (Convert.ToXxx per type) then calls AdsClient.WriteValueAsync returning ResultWrite; checks .ErrorCode for status mapping. BOOL-within-word writes throw NotSupportedException with a pointer to task #181 — same RMW gap as Modbus BitInRegister / AbCip BOOL-in-DINT / AbLegacy bit-within-N-file. ProbeAsync calls AdsClient.ReadStateAsync + checks AdsErrorCode.NoError. TwinCATDriver implements IReadable + IWritable — per-device ITwinCATClient cached in DeviceState.Client, lazy-connected on first read/write via EnsureConnectedAsync, connect-failure path disposes + clears the client so next call re-attempts cleanly. ReadAsync ordered-snapshot pattern matching AbCip/AbLegacy: unknown ref → BadNodeIdUnknown, unknown device → BadNodeIdUnknown, OperationCanceledException rethrow, any other exception → BadCommunicationError + Degraded health. WriteAsync similar — non-Writable tag → BadNotWritable upfront, NotSupportedException → BadNotSupported, FormatException/InvalidCastException (guard pattern) → BadTypeMismatch, OverflowException → BadOutOfRange, generic → BadCommunicationError. Symbol name resolution goes through TwinCATSymbolPath.TryParse(def.SymbolPath) with fallback to the raw def.SymbolPath if the path doesn't parse — the Beckhoff AdsClient handles the final validation at wire time. ShutdownAsync disposes each device's client. 14 new unit tests in TwinCATReadWriteTests using FakeTwinCATClient + FakeTwinCATClientFactory — unknown ref → BadNodeIdUnknown, successful DInt read with Good status + captured value + IsConnected=true after EnsureConnectedAsync, repeat reads reuse the connection (one Connect + multiple reads), ADS error code mapping via FakeTwinCATClient.ReadStatuses, read exception → BadCommunicationError + Degraded health, connect exception disposes the client, batched reads preserve order across DInt/Real/String types, non-Writable rejection, successful write logs symbol+type+value+bit for test inspection, write status-code mapping, write exception → BadCommunicationError, batch preserves order across success/non-writable/unknown, cancellation propagation, ShutdownAsync disposes the client. Total TwinCAT unit tests now 75/75 passing (+14 from PR 1's 61); full solution builds 0 errors; Modbus / AbCip / AbLegacy / other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:33:03 -04:00
bffac4db65 Merge pull request (#120) - TwinCAT scaffolding 2026-04-19 18:28:19 -04:00
Joseph Doherty
cd2c0bcadd TwinCAT PR 1 — Scaffolding + Core (TwinCATDriver + AMS address + symbolic path). New Driver.TwinCAT project referencing Beckhoff.TwinCAT.Ads 7.0.172 (the official Beckhoff .NET client — 1.6M+ downloads, actively maintained by Beckhoff + community). Package compiles without a local AMS router; wire calls need a running router (TwinCAT XAR on dev Windows, or the standalone Beckhoff.TwinCAT.Ads.TcpRouter embedded package for headless/CI). Same Core.Abstractions-only project shape as Modbus / AbCip / AbLegacy. TwinCATAmsAddress parses ads://{netId}:{port} canonical form — NetId is 6 dot-separated octets (NOT an IP; AMS router translates), port defaults to 851 (TC3 PLC runtime 1). Validates octet range 0-255 and port 1-65535. Case-insensitive scheme. Default-port stripping in canonical form for roundtrip stability. Rejects wrong scheme, missing //, 5-or-7-octet NetId, out-of-range octets/ports, non-numeric fragments. TwinCATSymbolPath handles IEC 61131-3 symbolic names — single-segment (Counter), POU.variable (MAIN.bStart), GVL.variable (GVL.Counter), structured member access (Motor1.Status.Running), array subscripts (Data[5]), multi-dim arrays (Matrix[1,2]), bit-access (Flags.3, GVL.Status.7), combined scope/member/subscript/bit (MAIN.Motors[0].Status.5). Roundtrip-safe ToAdsSymbolName produces the exact string AdsClient.ReadValue consumes. Rejects leading/trailing dots, space in idents, digit-prefix idents, empty/negative/non-numeric subscripts, unbalanced brackets. Underscore-prefix idents accepted per IEC. TwinCATDataType — BOOL / SINT / USINT / INT / UINT / DINT / UDINT / LINT / ULINT / REAL / LREAL / STRING / WSTRING (UTF-16) / TIME / DATE / DateTime (DT) / TimeOfDay (TOD) / Structure. Wider than Logix's surface — IEC adds WSTRING + TIME/DATE/DT/TOD variants. ToDriverDataType widens unsigned + 64-bit to Int32 matching the Modbus/AbCip/AbLegacy Int64-gap convention. TwinCATStatusMapper — Good / BadInternalError / BadNodeIdUnknown / BadNotWritable / BadOutOfRange / BadNotSupported / BadDeviceFailure / BadCommunicationError / BadTimeout / BadTypeMismatch. MapAdsError covers the ADS error codes a driver actually encounters — 6/7 port unreachable, 1792 service not supported, 1793/1794 invalid index group/offset, 1798 symbol not found (→ BadNodeIdUnknown), 1807 invalid state, 1808 access denied (→ BadNotWritable), 1811/1812 size mismatch (→ BadOutOfRange), 1861 sync timeout, unknown → BadCommunicationError. TwinCATDriverOptions + TwinCATDeviceOptions + TwinCATTagDefinition + TwinCATProbeOptions — one instance supports N AMS targets, Tags cross-key by HostAddress, Probe defaults to 5s interval (unlike AbLegacy there's no default probe address — ADS probe reads AmsRouterState not a user tag, so probe address is implicit). TwinCATDriver IDriver skeleton — InitializeAsync parses each device HostAddress + fails fast on malformed strings → Faulted. 61 new unit tests across 3 files — TwinCATAmsAddressTests (6 valid shapes + 12 invalid shapes + 2 ToString canonicalisation + roundtrip stability), TwinCATSymbolPathTests (9 valid shapes + 12 invalid shapes + underscore prefix + 8-case roundtrip), TwinCATDriverTests (DriverType + multi-device init + malformed-address fault + shutdown + reinit + data-type mapping theory + ADS error-code theory). Total project count 30 src + 19 tests; full solution builds 0 errors; Modbus / AbCip / AbLegacy / other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:26:29 -04:00
7fdf4e5618 Merge pull request (#119) - AbLegacy capabilities 2026-04-19 18:04:42 -04:00
Joseph Doherty
400fc6242c AB Legacy PR 3 — ITagDiscovery + ISubscribable + IHostConnectivityProbe + IPerCallHostResolver. Fills out the AbLegacy capability surface — the driver now implements the same 7-interface set as AbCip (IDriver + IReadable + IWritable + ITagDiscovery + ISubscribable + IHostConnectivityProbe + IPerCallHostResolver). ITagDiscovery emits pre-declared tags under an AbLegacy root folder with a per-device sub-folder keyed on HostAddress (DeviceName fallback to HostAddress when null). Writable tags surface as SecurityClassification.Operate, non-writable as ViewOnly. No controller-side enumeration — PCCC has no @tags equivalent on SLC / MicroLogix / PLC-5 (symbol table isn't exposed the way Logix exposes it), so the pre-declared path is the only discovery mechanism. ISubscribable consumes the shared PollGroupEngine extracted in AB CIP PR 1 — reader delegate points at ReadAsync (already handles lazy runtime init + caching), onChange bridges into the driver's OnDataChange event. 100ms interval floor. Initial-data push on first poll. Makes AbLegacy the third consumer of PollGroupEngine (after Modbus and AbCip). IHostConnectivityProbe — per-device probe loop when ProbeOptions.Enabled + ProbeAddress configured (defaults to S:0 status file word 0). Lazy-init on first tick, re-init on wire failure (destroyed native handle gets recreated rather than silently staying broken). Success transitions device to Running, exception to Stopped, same-state spurious event guard under per-device lock. GetHostStatuses returns one entry per device with current state + last-change timestamp for Admin /hosts surfacing. IPerCallHostResolver maps tag full-ref → DeviceHostAddress for the Phase 6.1 (DriverInstanceId, ResolvedHostName) bulkhead/breaker keying per plan decision #144. Unknown refs fall back to first device's address (invoker handles at capability level as BadNodeIdUnknown); no devices → DriverInstanceId. ShutdownAsync cancels + disposes each probe CTS, disposes PollGroupEngine cancelling active subscriptions, disposes every cached runtime. DeviceState gains ProbeLock / HostState / HostStateChangedUtc / ProbeCts / ProbeInitialized matching AbCip's DeviceState shape. 10 new unit tests in AbLegacyCapabilityTests covering — pre-declared tags emit under AbLegacy/device folder with correct SecurityClassification, subscription initial poll raises OnDataChange with correct value, unsubscribe halts polling (value change post-unsub produces no further events), GetHostStatuses returns one entry per device, probe Running transition on successful read, probe Stopped transition on read exception, probe disabled when ProbeAddress null, ResolveHost returns declared device for known tag, falls back to first device for unknown, falls back to DriverInstanceId when no devices. Total AbLegacy unit tests now 92/92 passing (+10 from PR 2's 82); full solution builds 0 errors; AbCip + Modbus + other drivers untouched. AB Legacy driver now complete end-to-end — SLC 500 / MicroLogix / PLC-5 / LogixPccc all shippable with read / write / discovery / subscribe / probe / host-resolve, feature-parity with AbCip minus IAlarmSource (same deferral per plan).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 18:02:52 -04:00
4438fdd7b1 Merge pull request (#118) - AbLegacy R/W 2026-04-19 18:00:27 -04:00
Joseph Doherty
b2424a0616 AB Legacy PR 2 — IReadable + IWritable. IAbLegacyTagRuntime + IAbLegacyTagFactory abstraction mirrors IAbCipTagRuntime from AbCip PR 3. LibplctagLegacyTagRuntime default implementation wraps libplctag.Tag with Protocol=ab_eip + PlcType dispatched from the profile's libplctag attribute (Slc500/MicroLogix/Plc5/LogixPccc) — libplctag routes PCCC-over-EIP internally based on PlcType, so our layer just forwards the atomic type to Get/Set calls. DecodeValue handles Bit (GetBit when bitIndex is set, else GetInt8!=0), Int/AnalogInt (GetInt16 widened to int), Long (GetInt32), Float (GetFloat32), String (GetString), TimerElement/CounterElement/ControlElement (GetInt32 — sub-element selection is in the libplctag tag name like T4:0.ACC, PLC-side decode picks the right slot). EncodeValue handles the same types; bit-within-word writes throw NotSupportedException pointing at follow-up task #181 (same read-modify-write gap as Modbus BitInRegister). AbLegacyDriver implements IReadable + IWritable with the exact same shape as AbCip PR 3-4 — per-tag lazy runtime init via EnsureTagRuntimeAsync cached in DeviceState.Runtimes dict, ordered-snapshot results, health surface updates. Exception table — OperationCanceledException rethrows, NotSupportedException → BadNotSupported, FormatException/InvalidCastException → BadTypeMismatch (guard pattern C# 11 syntax), OverflowException → BadOutOfRange, anything else → BadCommunicationError. ShutdownAsync disposes every cached runtime so the native tag handles get released. 14 new unit tests in AbLegacyReadWriteTests covering unknown ref → BadNodeIdUnknown, successful N-file read with Good status + captured value, repeat-read reuses cached runtime (init count 1 across 2 reads), libplctag non-zero status mapping (-14 → BadNodeIdUnknown), read exception → BadCommunicationError + Degraded health, batched reads preserve order across N/F/ST types, TagCreateParams composition (gateway/port/path/slc500 attribute/tag-name), non-writable tag → BadNotWritable, successful write encodes + flushes, bit-within-word → BadNotSupported (RmwThrowingFake mirrors LibplctagLegacyTagRuntime's runtime check), write exception → BadCommunicationError, batch preserves order across success+fail+unknown, cancellation propagates, ShutdownAsync disposes runtimes. Total AbLegacy unit tests now 82/82 passing (+14 from PR 1's 68). Full solution builds 0 errors; Modbus + AbCip + other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:58:38 -04:00
59c99190c6 Merge pull request (#117) - AbLegacy scaffolding 2026-04-19 17:56:15 -04:00
Joseph Doherty
fc575e8dae AB Legacy PR 1 — Scaffolding + Core (AbLegacyDriver + PCCC address parser). New Driver.AbLegacy project with the libplctag 1.5.2 reference + the same Core.Abstractions-only project shape AbCip uses. AbLegacyHostAddress duplicates the ab://gateway[:port]/cip-path parser from AbCip since PCCC-over-EIP uses the same gateway routing convention (SLC 500 direct-wired with empty path, PLC-5 bridged through a ControlLogix chassis with full CIP path). Parser is 30 lines; copy was cheaper than introducing a shared Ab* project just to avoid duplication. AbLegacyAddress handles PCCC file addressing — file-letter + optional file-number + colon + word-number + optional sub-element (.ACC / .PRE / .EN / .DN / .CU / .CD / .LEN / .POS / .ER) + optional /bit-index. Handles the full shape variety — N7:0 (integer file 7 word 0), F8:5 (float file 8 word 5), B3:0/0 (bit file 3 word 0 bit 0), ST9:0 (string file 9 string 0), L9:3 (long file SLC 5/05+), T4:0.ACC (timer accumulator), C5:2.CU (counter count-up bit), R6:0.LEN (control length), I:0/0 (input file bit — no file number for I/O/S), O:1/2 (output file bit), S:1 (status file word), N7:0/3 (bit within integer file). Validates file letters against the canonical SLC/ML/PLC-5 set (N/F/B/L/ST/T/C/R/I/O/S/A). ToLibplctagName roundtrips so the parsed value can be handed straight to libplctag's name= attribute. AbLegacyDataType — Bit / Int (N-file, 16-bit signed) / Long (L-file, 32-bit, SLC 5/05+ only) / Float (F-file, 32-bit IEEE-754) / AnalogInt (A-file) / String (ST-file, 82-byte fixed + length word) / TimerElement / CounterElement / ControlElement. ToDriverDataType widens Long to Int32 matching the Modbus/AbCip Int64-gap convention. AbLegacyStatusMapper shares the OPC UA status constants with AbCip (same numeric values, different namespace). MapLibplctagStatus mirrors AbCip — 0 success, positive pending, negative error code families. MapPcccStatus handles PCCC STS bytes — 0x00 success, 0x10 illegal command, 0x20 bad address, 0x30 protected, 0x40/0x50 busy, 0xF0 extended status. AbLegacyDriverOptions + AbLegacyDeviceOptions + AbLegacyTagDefinition + AbLegacyProbeOptions mirror AbCip shapes — one instance supports N devices via Devices list, Tags list references devices by HostAddress cross-key, Probe uses S:0 by default as the cheap probe address. AbLegacyPlcFamilyProfile for four families — Slc500 (slc500 attribute, 1,0 default path, supports L + ST files, 240B max PCCC packet), MicroLogix (micrologix attribute, empty path for direct EIP, supports ST but not L), Plc5 (plc5 attribute, 1,0 default path, supports ST but predates L), LogixPccc (logixpccc attribute, full Logix ConnectionSize + L file support via the PCCC compatibility layer on ControlLogix). AbLegacyDriver implements IDriver only — InitializeAsync parses each device's HostAddress and selects its profile (fails fast on malformed strings → Faulted health), per-device state with parsed address + options + profile + empty placeholder for PRs 2-3. ShutdownAsync clears the device dict. 68 new unit tests across 3 files — AbLegacyAddressTests (15 valid shapes + 10 invalid shapes + 7 ToLibplctagName roundtrip), AbLegacyHostAndStatusTests (4 valid host + 5 invalid host + 8 PCCC STS + 7 libplctag status), AbLegacyDriverTests (IDriver lifecycle + multi-device init with per-family profile selection + malformed-address fault + shutdown + family profile defaults + ForFamily theory + data-type mapping). Total project count 29 src + 18 tests; full solution builds 0 errors; Modbus + AbCip + other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:54:25 -04:00
70f5f2cad1 Merge pull request (#116) - AbCip family profiles 2026-04-19 17:20:43 -04:00
Joseph Doherty
60b8d6f2d0 AB CIP PR 9-12 — Per-PLC-family profile tests + GuardLogix safety-tag support. Consolidates PRs 9/10/11/12 from the plan (ControlLogix / CompactLogix / Micro800 / GuardLogix integration suites) into a single PR because the per-family work that actually ships without a live ab_server binary is profile-metadata assertion + unit-level driver-option binding. Per-family integration tests that require a running simulator are deferred to the ab_server-CI follow-up already tracked from PR 3 (download prebuilt Windows binary as GitHub release asset). ControlLogix — baseline profile asserted (controllogix attribute, 4002 LFO ConnectionSize, 1,0 default path, request-packing + connected-messaging, 4000B max fragment). CompactLogix — narrower 504 ConnectionSize for 5069-L3x safety, 500B max fragment, lib attribute compactlogix which libplctag maps to the ControlLogix family internally but via our profile chain we surface it as a distinct knob so future quirk handling (5069 narrow-window regression cases) hangs off the compactlogix attribute. Micro800 — empty CIP path for no-backplane routing, 488B ConnectionSize, 484B fragment cap, request packing + connected messaging both disabled (most models reject Forward_Open), micro800 lib attribute. Test asserts the driver correctly parses an ab://192.168.1.20/ host address with empty path + forwards the empty path through AbCipTagCreateParams so libplctag sees the unconnected-only configuration. GuardLogix — wire protocol identical to ControlLogix (safety partition is a per-tag concern, not a wire-layer distinction) so profile defaults match ControlLogix. New AbCipTagDefinition.SafetyTag field — when true, the driver forces SecurityClassification.ViewOnly in discovery regardless of the Writable flag, and IWritable rejects the write upfront with BadNotWritable. Matches the Rockwell safety-partition isolation model where non-safety-task writes to safety tags would be rejected by the PLC anyway — surfacing the intent at the driver surface prevents wasted wire round-trips + gives Admin UI users a correct ViewOnly rendering. 14 new unit tests in AbCipPlcFamilyTests covering — ControlLogix profile defaults + correct profile selection at Initialize, CompactLogix narrower-than-ControlLogix ConnectionSize + fragment cap, Micro800 empty path parses + SupportsConnectedMessaging=false + SupportsRequestPacking=false + read forwards empty path + micro800 attribute through to libplctag, GuardLogix wire-protocol parity with ControlLogix, GuardLogix safety tag surfaces as ViewOnly in discovery even when Writable=true, GuardLogix safety-tag write rejected with BadNotWritable even when Writable=true, ForFamily theory (4 families → correct libplctag attribute). Total AbCip unit tests now 161/161 passing (+14 from PR 8's 147). Modbus + other drivers untouched; full solution builds 0 errors. PR 13 (IAlarmSource via tag-projected ALMA/ALMD blocks) remains deferred per the plan — feature-flagged pattern not needed before go-live.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:18:51 -04:00
30f971599e Merge pull request (#115) - AbCip host probe + resolver 2026-04-19 17:16:59 -04:00
Joseph Doherty
ac14ba9664 AB CIP PR 8 — IHostConnectivityProbe + IPerCallHostResolver. Per-device probe loop — when AbCipProbeOptions.Enabled + ProbeTagPath are configured, InitializeAsync kicks off one probe task per device that periodically reads the probe tag (lazy-init on first attempt, re-init on wire failure so destroyed native handles get recreated rather than silently staying broken), transitions Running on status==0 or Stopped on non-zero status / exception, raises OnHostStatusChanged with the device HostAddress as the host-name key. TransitionDeviceState guards against spurious same-state events under a per-device lock. ShutdownAsync cancels + disposes each probe's CTS + its captured runtime. DeviceState record gains ProbeLock / HostState / HostStateChangedUtc / ProbeCts / ProbeInitialized fields. IHostConnectivityProbe.GetHostStatuses returns one HostConnectivityStatus per device with the current state + last-change timestamp, surfaced to Admin /hosts per plan decision #144. IPerCallHostResolver.ResolveHost maps a tag full-reference to its DeviceHostAddress via the _tagsByName dict populated at Initialize time, which means UDT member full-references (Motor1.Speed synthesised by PR 6) resolve to the parent UDT's device without extra bookkeeping. Unknown references fall back to the first configured device's host address (invoker handles the actual mislookup at read time as BadNodeIdUnknown), and when no devices are configured resolver returns DriverInstanceId so the single-host fallback pipeline still works. Matches the plan decision #144 contract — Phase 6.1 resilience keys its bulkhead + breaker on (DriverInstanceId, ResolvedHostName) so a dead PLC trips only its own breaker, healthy siblings keep serving. 10 new unit tests in AbCipHostProbeTests covering GetHostStatuses returning one entry per device, probe success transitioning Unknown → Running, probe exception transitioning to Stopped, Enabled=false skipping the loop (no events + state stays Unknown), null ProbeTagPath skipping the loop, multi-device independent probe behavior (one Running + one Stopped simultaneously), ResolveHost for known tags returning the declared DeviceHostAddress, ResolveHost for unknown ref falling back to first device, ResolveHost falling back to DriverInstanceId when no devices, ResolveHost for UDT member walking to the synthesised member definition. Total AbCip unit tests now 147/147 passing (+10 from PR 7's 137). Full solution builds 0 errors; Modbus + other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:15:10 -04:00
5978ea002d Merge pull request (#114) - AbCip ISubscribable 2026-04-19 17:13:41 -04:00
Joseph Doherty
33780eb64c AB CIP PR 7 — ISubscribable via shared PollGroupEngine. AbCipDriver now implements ISubscribable — Subscribe delegates into the PollGroupEngine extracted in PR 1, Unsubscribe releases the subscription, ShutdownAsync disposes the engine cancelling every active subscription. OnDataChange event wired through the engine's on-change callback so external subscribers see the driver as sender. The engine's reader delegate points at the driver's ReadAsync (already handles lazy runtime init + caching via EnsureTagRuntimeAsync) — each poll tick batch-reads every subscribed tag in one IReadable call. 100ms interval floor inherited from PollGroupEngine.DefaultMinInterval matches Modbus convention. Initial-data push on first poll preserved via forceRaise=true. Exception-tolerant loop preserved — individual read failures show up as DataValueSnapshot with non-Good StatusCode via the status-code mapping PR 3 established. 7 new unit tests in AbCipSubscriptionTests covering initial-poll raising per tag, unchanged value raising only once, value change between polls triggering a new event, Unsubscribe halting the loop, 100ms floor keeping a 5ms request from generating extra events against a stable value, ShutdownAsync cancelling active subscriptions, UDT member subscription routing through the synthesised Motor1.Speed full-reference (proving PR 6's fan-out composes correctly with PR 7's subscription path). Total AbCip unit tests now 137/137 passing (+7 from PR 6's 130). Validates that the shared PollGroupEngine from PR 1 works correctly for a second driver, closing the original motivation for the extraction. Full solution builds 0 errors; Modbus + other drivers untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:11:51 -04:00
521bcb2f68 Merge pull request (#113) - AbCip UDT members 2026-04-19 17:11:08 -04:00
Joseph Doherty
b06a1ba607 AB CIP PR 6 — UDT member-declaration support. Declaration-driven UDT member fan-out — users declare a UDT-typed tag once with an explicit Members list and the driver (1) expands member-addressable tags synthetically at Initialize time so Read/Write/Subscribe hit individual native tags per member, (2) emits a folder + one Variable per member in DiscoverAsync instead of a single opaque Structure Variable. Matches the Logix 5000 addressing convention where members are reached via dotted syntax (Motor1.Speed, Motor1.Running) — AbCipTagPath already parsed this shape in PR 2, so PR 6 just had to wire config→TagPath composition. New AbCipStructureMember record — Name / DataType / Writable / WriteIdempotent — plus optional Members list on AbCipTagDefinition that's ignored for atomic types and optional for Structure types. When Structure has null or empty Members the driver falls back to emitting a single opaque Variable so downstream config can address members manually (the "black box" path documented in AbCipTagDefinition's docstring). AbCipDriver.InitializeAsync now iterates tags + for every Structure tag with non-empty Members synthesises a child AbCipTagDefinition per member (composed full-reference Parent.Member + composed TagPath parent.member passed through to libplctag as a normal symbolic read). Per-member Writable/WriteIdempotent metadata propagates so IWritable correctly rejects writes to members flagged non-writable even when the parent tag is writable — each member stands alone from the resilience + authz perspective. DiscoverAsync gains a matching branch — Structure with Members emits an intermediate folder named after the parent tag + one Variable per member under it (browse name = member.Name, FullName = Parent.Member). Members with Writable=false surface SecurityClassification.ViewOnly, WriteIdempotent flag passes through to the DriverAttributeInfo. Structure without Members falls through to the normal single-Variable path. Whole-UDT read optimization (one libplctag call returns the packed buffer + client-side member decode) is deferred — needs the CIP Template Object class 0x6C reader which is blocked on the same libplctag 1.5.2 TagInfoPlcMapper gap that deferred the real @tags walker in PR 5. AbCipTemplateCache shipped in PR 5 is the drop-in point when that reader lands. Per-member reads today are N native round-trips; whole-UDT optimisation is a perf win, not a correctness gap. 7 new unit tests in AbCipUdtMemberTests — UDT fan-out to Variable children under folder with correct SecurityClassification + WriteIdempotent propagation, member reads via synthesised full-reference with correct per-member values, member writes routing to correct TagPath, member Writable=false flag correctly blocking IWritable, Structure without Members falls back to single Variable, empty Members list treated identically to null, UDT tags coexist with flat tags in the discovery output. Total AbCip unit tests now 130/130 passing (+7 from PR 5's 123). Modbus + other drivers untouched; full solution builds 0 errors. Unblocks PR 7 (ISubscribable) — the poll engine already works with member-level full references.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:09:06 -04:00
dd1389a8e7 Merge pull request (#112) - AbCip ITagDiscovery 2026-04-19 17:07:03 -04:00
Joseph Doherty
447086892e AB CIP PR 5 — ITagDiscovery (pre-declared emission + controller-enumeration scaffolding). DiscoverAsync streams tags to IAddressSpaceBuilder with the same shape the Modbus driver uses, keyed by device host address so one driver instance exposing N PLCs produces N device folders under a shared "AbCip" root. Pre-declared tags from AbCipDriverOptions.Tags emit first, filtered through AbCipSystemTagFilter so __DEFVAL_* / __DEFAULT_* / Routine: / Task: / Local:N:X / Map: / Axis: / Cam: / MotionGroup: infrastructure names never reach the address space. Writable tags map to SecurityClassification.Operate, non-writable to ViewOnly. Controller enumeration (walking the Logix Symbol Object via @tags) is wired up through a new IAbCipTagEnumerator + IAbCipTagEnumeratorFactory abstraction — default EmptyAbCipTagEnumeratorFactory returns an empty sequence so the driver stays production-safe without a real decoder. Tests inject FakeEnumeratorFactory to exercise the discovered-tag path: discovered tags land under a Discovered/ sub-folder, program-scope produces Program:P.Name full references, the IsSystemTag hint + the AbCipSystemTagFilter both act as gates, ReadOnly surfaces SecurityClassification.ViewOnly. The real @tags walker is a follow-up because libplctag 1.5.2 (latest stable on NuGet) does not expose TagInfoPlcMapper / UdtInfoMapper — the DataTypes namespace only ships IPlcMapper<T>, so enumerating the Symbol Object requires either implementing a custom IPlcMapper for the CIP byte layout or raw-buffer decoding via plc_tag_get_raw — both non-trivial enough to warrant their own PR. Code comment on EmptyAbCipTagEnumerator documents the gap + points to the follow-up. AbCipTemplateCache placeholder ships with a ConcurrentDictionary<(device, templateInstanceId), AbCipUdtShape> + Put / TryGet / Clear / Count — the Template Object reader (CIP class 0x6C) populates it in PR 6 and FlushOptionalCachesAsync now clears it. AbCipUdtShape + AbCipUdtMember records describe UDT layout — type name + total size + ordered members with offset / type / array length. AbCipDriver ctor gains optional enumeratorFactory parameter matching the tagFactory pattern from PR 3. TemplateCache exposed internally for PR 6's reader to write into. 25 new unit tests in AbCipDriverDiscoveryTests covering — pre-declared emission under device folder, DeviceName fallback to host address, system-tag filter rejecting pre-declared infrastructure names, cross-device tag filtering (tags for a device this driver does not own are ignored), controller enumeration adds tags under Discovered/, system-tag hint + filter both enforced, ReadOnly → ViewOnly, AbCipTagCreateParams composition (gateway / port / CIP path / libplctag attribute / tag name "@tags" / timeout), default enumerator factory used when not injected, 13 Theory cases covering every AbCipSystemTagFilter pattern, template cache roundtrip + clear, FlushOptionalCachesAsync clears the cache. Total AbCip unit tests now 123/123 passing (+25 from PR 4's 98). Modbus + other existing tests untouched; full solution builds 0 errors. Unblocks PR 6 (UDT structured read/write) + PR 7 (subscriptions consuming PollGroupEngine from PR 1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:05:02 -04:00
cee52a9134 Merge pull request (#111) - AbCip IWritable 2026-04-19 16:59:58 -04:00
Joseph Doherty
257f4fd3f5 AB CIP PR 4 — IWritable implementation. LibplctagTagRuntime.EncodeValue fills in the switch for every atomic Logix type the driver currently surfaces — Bool (standalone BOOL via SetInt8 0/1), SInt/USInt (SetInt8/SetUInt8), Int/UInt (SetInt16/SetUInt16), DInt/UDInt (SetInt32/SetUInt32), LInt/ULInt (SetInt64/SetUInt64), Real (SetFloat32), LReal (SetFloat64), String (SetString 0), Dt (epoch DINT via SetInt32). BOOL-within-DINT writes throw NotSupportedException with a code comment matching the Modbus BitInRegister pattern at ModbusDriver.cs line 640 — the read-modify-write logic + lock-per-DINT discipline is a follow-up PR rather than squeezing it into the initial wire plumbing. Structure writes throw NotSupportedException pointing at PR 6 when UDT support lands. AbCipDriver now implements IWritable. WriteAsync iterates writes preserving order, short-circuits on unknown reference → BadNodeIdUnknown, on non-writable tag definition → BadNotWritable, on unknown device → BadNodeIdUnknown. Happy path materialises the cached runtime via EnsureTagRuntimeAsync (shares PR 3's lazy-init path so read+write on the same tag hits one native handle), EncodeValue into the tag's buffer, WriteAsync flushes, GetStatus confirms the wire status, maps libplctag error codes via AbCipStatusMapper.MapLibplctagStatus, sets health Healthy on success. Per plan decisions #44, #45, #143 the driver does NOT auto-retry writes — that's a resilience-layer concern (Polly pipeline sitting above) keyed on the tag's WriteIdempotent flag. Exception-mapping table — OperationCanceledException rethrows (honors cancellation), NotSupportedException → BadNotSupported (bit-in-DINT, Structure, future unsupported types), FormatException → BadTypeMismatch (Convert.ToInt32 of a non-numeric string), InvalidCastException → BadTypeMismatch (caller passed an object incompatible with the conversion target), OverflowException → BadOutOfRange (value exceeds target type range, e.g. Int16 write of 1_000_000), any other Exception → BadCommunicationError (wire drop, libplctag-internal failure). Health surface updates Degraded on every non-Cancellation exception path, Healthy on success. Introduces AbCipStatusMapper.BadTypeMismatch (0x80730000). 10 new unit tests in AbCipDriverWriteTests covering — unknown ref → BadNodeIdUnknown, non-writable tag → BadNotWritable, successful DInt write encodes + flushes the value + marks WriteCount=1, BOOL-in-DINT rejected as BadNotSupported (separate ThrowingBoolBitFake mirrors LibplctagTagRuntime's runtime check), non-zero libplctag status after write mapped via AbCipStatusMapper (timeout -5 → BadTimeout), FormatException from non-numeric-string write → BadTypeMismatch (RealConvertFake exercises real Convert.ToInt32), OverflowException from Int16 write of 1_000_000 → BadOutOfRange, generic exception during write → BadCommunicationError + health Degraded, batch with mixed success+failure preserves order across four request types, cancellation propagates as OperationCanceledException. FakeAbCipTag's test-fake base class methods made virtual so override hooks work correctly through the IAbCipTagRuntime interface (new-shadow was silently falling through to the base implementation). Total AbCip unit tests now 98/98 passing; Modbus + other existing tests untouched; full solution builds 0 errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:57:52 -04:00
be2379107d Merge pull request (#110) - AbCip IReadable 2026-04-19 16:41:02 -04:00
Joseph Doherty
cc35c77d64 AB CIP PR 3 — IReadable implementation against libplctag. Introduces IAbCipTagRuntime + IAbCipTagFactory abstraction matching the Modbus transport-factory pattern (ctor optional arg, default production impl injected) so the driver's read/status-mapping logic is unit-testable without a live PLC or the native libplctag binary. LibplctagTagRuntime is the default wire-backed implementation — wraps libplctag.Tag + translates our AbCipDataType enum into GetInt8/GetUInt8/GetInt16/GetUInt16/GetInt32/GetUInt32/GetInt64/GetUInt64/GetFloat32/GetFloat64/GetString/GetBit calls covering Bool (standalone + BOOL-in-DINT via .N bit selector), SInt/USInt, Int/UInt, DInt/UDInt, LInt/ULInt, Real, LReal, String, Dt (epoch DINT), with Structure deferred to PR 6. MapPlcType bridges our libplctag attribute strings (controllogix, compactlogix, micro800) to libplctag.PlcType enum; CompactLogix rolls under ControlLogix per libplctag's family grouping which matches the wire protocol reality. AbCipDriver now implements IReadable — ReadAsync iterates fullReferences preserving order, looks up each tag definition + its device, lazily materialises the tag runtime via EnsureTagRuntimeAsync on first touch (cached thereafter for the lifetime of the device), catches OperationCanceledException to honor cancellation, maps libplctag non-zero status via AbCipStatusMapper.MapLibplctagStatus, catches any other exception as BadCommunicationError. Health surface moves to Healthy on success + Degraded with the last error message on failure. Initialize-failure path disposes the half-created runtime before rethrowing so no native handles leak. DeviceState gains a Runtimes dict alongside the existing TagHandles collection; DisposeHandles walks both so ShutdownAsync + ReinitializeAsync cleanly destroy every native tag. 12 new unit tests in AbCipDriverReadTests using FakeAbCipTag / FakeAbCipTagFactory (test fake under tests/...AbCip.Tests/FakeAbCipTag.cs) covering unknown reference → BadNodeIdUnknown, unknown device → BadNodeIdUnknown, successful DInt read with correct Good status + captured value, lazy-init on first read with reuse across subsequent reads, non-zero libplctag status mapping via AbCipStatusMapper, exception during read surfacing as BadCommunicationError with health Degraded, batched reads preserving order + per-tag status, health Healthy after success, TagCreateParams composition from device + profile (gateway / port / CIP path / libplctag attribute / tag name wiring), cancellation propagation via OperationCanceledException, ShutdownAsync disposing every runtime, Initialize-failure disposing the aborted runtime. Total AbCip unit tests now 88/88 passing. Integration test project scaffolding — tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests with AbServerFixture (IAsyncLifetime that starts ab_server when the binary is on PATH, otherwise marks IsAvailable=false), AbServerFact attribute (Fact-equivalent that skips when ab_server is missing), one smoke test exercising DInt read end-to-end. Project runs cleanly — the single smoke test skips on boxes without ab_server (0 failed, 0 passed, 1 skipped) + runs on boxes with it. Follow-up work captured in comments — ab_server CI fixture (download prebuilt Windows x64 binary as GitHub release asset) + per-family JSON profiles + hand-rolled CIP stub for UDT fidelity ship in the PR 6/9-12 window. Solution file updated. Full solution builds 0 errors across all 28 projects. Modbus + other existing tests untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:38:54 -04:00
59b59b8ccd Merge pull request (#109) - AbCip scaffolding 2026-04-19 16:00:28 -04:00
Joseph Doherty
3e0452e8a4 AB CIP PR 2 — scaffolding + Core (AbCipDriver skeleton + libplctag binding + host / tag-path / data-type / status-code parsers + per-family profiles + SafeHandle wrapper + test harness). Ships everything needed to stand up the driver project as a compiling assembly with no wire calls yet — PR 3 adds IReadable against ab_server which is the first PR that actually touches the native library. Project reference shape matches Modbus / OpcUaClient / S7 (only Core.Abstractions, no Core / Configuration / Polly) so the driver stays lean and doesn't drag EF Core into every deployment that wants AB support. libplctag 1.5.2 pinned (1.6.x only exists as alpha — stable 1.5 series covers ControlLogix / CompactLogix / Micro800 / SLC500 / PLC-5 / MicroLogix which matches plan decision #11 family coverage). libplctag.NativeImport arrives transitively. AbCipHostAddress parses ab://gateway[:port]/cip-path canonical strings end-to-end: handles hostname or IP gateway, optional explicit port (default 44818 EtherNet-IP reserved), CIP path including bridged routes (1,2,2,10.0.0.10,1,0), empty path for Micro800 / MicroLogix without backplane routing, case-insensitive scheme, default-port stripping in canonical form for round-trip stability. Opaque string survives straight into libplctag's gateway / path attributes so no translation layer at wire time. AbCipTagPath handles the full Logix symbolic tag surface — controller-scope (Motor1_Speed), program-scope (Program:MainProgram.StepIndex), structured member access (Motor1.Speed.Setpoint), multi-dim array subscripts (Matrix[1,2,3]), bit-within-DINT via .N syntax (Flags.3, Motor.Status.12) with valid range 0-31 per Logix 5000 General Instructions Reference. Structural capture so PR 6 UDT work can walk the path against a cached template without reparsing. Rejects malformed shapes (empty scopes, ident starting with digit, spaces, empty/negative/non-numeric subscripts, unbalanced brackets, leading / trailing dots). Round-trips via ToLibplctagName producing the exact string libplctag's name attribute expects. AbCipDataType mirrors ModbusDataType shape — atomic Bool / SInt / Int / DInt / LInt / USInt / UInt / UDInt / ULInt / Real / LReal / String / Dt plus a Structure marker for UDT-typed tags (resolved via CIP Template Object at discovery time in PR 5/6). ToDriverDataType adapter follows the Modbus widening convention for unsigned + 64-bit until DriverDataType picks those up. AbCipStatusMapper covers the CIP general-status values an AB PLC actually returns during normal operation (0x00/0x04/0x05/0x06/0x08/0x0A/0x0B/0x0E/0x10/0x13/0x16) + libplctag PLCTAG_STATUS_* codes (0, >0 pending, negative error families). Mirrors ModbusDriver.MapModbusExceptionToStatus so Admin UI status displays stay uniform across drivers. PlcTagHandle is a SafeHandle around the int32 native tag ID with plc_tag_destroy slot wired as a no-op for PR 2 (P/Invoke DllImport arrives with PR 3 when the wire calls land). Lifetime guaranteed by the SafeHandle finalizer — every leaked handle gets cleaned up even when the owner is GC'd without explicit Dispose. IsInvalid when native ID <= 0 so destroying a negative (error) handle never happens. Critical because driver-specs.md §3 flags libplctag native heap as invisible to GetMemoryFootprint — leaked handles directly feed the Tier-B recycle trigger. AbCipDriverOptions captures the multi-device shape — one driver instance can talk to N PLCs via Devices[] (each with HostAddress + PlcFamily + optional DeviceName); Tags[] references devices by HostAddress as the cross-key; AbCipProbeOptions + driver-wide Timeout. AbCipDriver implements IDriver only — InitializeAsync parses every device's HostAddress and selects its PlcFamilyProfile (fails fast on malformed strings via InvalidOperationException → Faulted health), per-device state cached in a DeviceState record with parsed address + profile + empty TagHandles dict for later PRs. ReinitializeAsync is the Tier-B escape hatch — shuts down every device, disposes every PlcTagHandle via SafeHandle lifetime, reinitializes from options. ShutdownAsync clears the device dict and flips health to Unknown. PlcFamilies/AbCipPlcFamilyProfile gives four baseline profiles — ControlLogix (4002 ConnectionSize, path 1,0, Large Forward Open + request packing + connected messaging, FW20+ baseline), CompactLogix (narrower 504 default for 5069-L3x safety), Micro800 (488 cap, empty path, unconnected-only, no request packing), GuardLogix (shares ControlLogix wire protocol — safety partition is tag-level, surfaced as ViewOnly in PR 12). Tests — 76 new cases across 4 test classes — AbCipHostAddressTests (10 valid shapes, 10 invalid shapes, ToString canonicalization, round-trip stability), AbCipTagPathTests (18 cases including multi-scope / multi-member / multi-subscript / bit-in-DINT / rejected shapes / underscore idents / round-trip), AbCipStatusMapperTests (12 CIP + 8 libplctag codes), AbCipDriverTests (IDriver lifecycle + multi-device init + malformed-address fault + per-family profile lookup + PlcTagHandle invalid/dispose idempotency + AbCipDataType mapping). Full solution builds 0 errors; 254 warnings are pre-existing xUnit1051 CancellationToken hints outside this PR. Solution file updated to include both new projects. Unblocks PR 3 (IReadable against ab_server) which is the first PR to exercise the native library end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:58:15 -04:00
bff6651b4b Merge pull request (#108) - PollGroupEngine extraction 2026-04-19 15:51:11 -04:00
Joseph Doherty
4ab587707f AB CIP PR 1 — extract shared PollGroupEngine into Core.Abstractions so the AB CIP driver (and any other poll-based driver — S7, FOCAS, AB Legacy) can reuse the subscription loop instead of reimplementing it. Behaviour-preserving refactor of ModbusDriver: SubscriptionState + PollLoopAsync + PollOnceAsync + ModbusSubscriptionHandle lifted verbatim into a new PollGroupEngine class, ModbusDriver's ISubscribable surface now delegates Subscribe/Unsubscribe into the engine and ShutdownAsync calls engine DisposeAsync. Interval floor (100 ms default) becomes a PollGroupEngine constructor knob so per-driver tuning is possible without re-shipping the loop. Initial-data push semantics preserved via forceRaise=true on the first poll. Exception-tolerant loop preserved — reader throws are swallowed, loop continues, driver's health surface remains the single reporting path. Placement in Core.Abstractions (not Core) because driver projects only reference Core.Abstractions by convention (matches OpcUaClient / Modbus / S7 csproj shape); putting the engine in Core would drag EF Core + Serilog + Polly into every driver. Module has no new dependencies beyond System.Collections.Concurrent + System.Threading, so Core.Abstractions stays lightweight. Modbus ctor converted from primary to explicit so the engine field can capture this for the reader + on-change bridge. All 177 ModbusDriver.Tests pass unmodified (Modbus subscription suite, probe suite, cap suite, exception mapper, reconnect, TCP). 10 new direct engine tests in Core.Abstractions.Tests covering: initial force-raise, unchanged-value single-raise, change-between-polls, unsubscribe halts loop, interval-floor clamp, independent subscriptions, reader-exception tolerance, unknown-handle returns false, ActiveSubscriptionCount lifecycle, DisposeAsync cancels all. No changes to driver-specs.md nor to the server Hosting layer — engine is a pure internal building block at this stage. Unblocks AB CIP PR 7 (ISubscribable consumes the engine); also sets up S7 + FOCAS to drop their own poll loops when they re-base.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:34:44 -04:00
2172d49d2e Merge pull request (#107) - in-flight counter 2026-04-19 15:04:29 -04:00
Joseph Doherty
ae8f226e45 Phase 6.1 Stream E.3 partial — in-flight counter feeds CurrentBulkheadDepth
Closes the observer half of #162 that was flagged as "persisted as 0 today"
in PR #105. The Admin /hosts column refresh + FleetStatusHub SignalR push
+ red-badge visual still belong to the visual-compliance pass.

Core.Resilience:
- DriverResilienceStatusTracker gains RecordCallStart + RecordCallComplete
  + CurrentInFlight field on the snapshot record. Concurrent-safe via the
  same ConcurrentDictionary.AddOrUpdate pattern as the other recorder methods.
  Clamps to zero on over-decrement so a stray Complete-without-Start can't
  drive the counter negative.
- CapabilityInvoker gains an optional statusTracker ctor parameter. When
  wired, every ExecuteAsync / ExecuteAsync(void) wraps the pipeline call
  in try / finally that records start/complete — so the counter advances
  cleanly whether the call succeeds, cancels, or throws. Null tracker keeps
  the pre-Phase-6.1 Stream E.3 behaviour exactly.

Server.Hosting:
- ResilienceStatusPublisherHostedService persists CurrentInFlight as the
  DriverInstanceResilienceStatus.CurrentBulkheadDepth column (was 0 before
  this PR). One-line fix on both the insert + update branches.

The in-flight counter is a pragmatic proxy for Polly's internal bulkhead
depth — a future PR wiring Polly telemetry would replace it with the real
value. The shape of the column + the publisher + the Admin /hosts query
doesn't change, so the follow-up is invisible to consumers.

Tests (8 new InFlightCounterTests, all pass):
- Start+Complete nets to zero.
- Nested starts sum; Complete decrements.
- Complete-without-Start clamps to zero.
- Different hosts track independently.
- Concurrent starts (500 parallel) don't lose count.
- CapabilityInvoker observed-mid-call depth == 1 during a pending call.
- CapabilityInvoker exception path still decrements (try/finally).
- CapabilityInvoker without tracker doesn't throw.

Full solution dotnet test: 1243 passing (was 1235, +8). Pre-existing
Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:02:34 -04:00
e032045247 Merge pull request (#106) - Phase 6.4 Stream B staging tables 2026-04-19 14:57:39 -04:00
Joseph Doherty
ad131932d3 Phase 6.4 Stream B.2-B.4 server-side — EquipmentImportBatch staging + FinaliseBatch transaction
Closes the server-side/data-layer piece of Phase 6.4 Stream B.2-B.4. The
CSV-import preview + modal UI (Stream B.3/B.5) still belongs to the Admin
UI follow-up — this PR owns the staging tables + atomic finalise alone.

Configuration:
- New EquipmentImportBatch entity (Id, ClusterId, CreatedBy, CreatedAtUtc,
  RowsStaged/Accepted/Rejected, FinalisedAtUtc?). Composite index on
  (CreatedBy, FinalisedAtUtc) powers the Admin preview modal's "my open
  batches" query.
- New EquipmentImportRow entity — one row per CSV row, 8 required columns
  from decision #117 + 9 optional from decision #139 + IsAccepted flag +
  RejectReason. FK to EquipmentImportBatch with cascade delete so
  DropBatch collapses the whole tree.
- EF migration 20260419_..._AddEquipmentImportBatch.
- SchemaComplianceTests expected tables list gains the two new tables.

Admin.Services.EquipmentImportBatchService:
- CreateBatchAsync — new header row, caller-supplied ClusterId + CreatedBy.
- StageRowsAsync(batchId, acceptedRows, rejectedRows) — bulk-inserts the
  parsed CSV rows into staging. Rejected rows carry LineNumberInFile +
  RejectReason for the preview modal. Throws when the batch is finalised.
- DropBatchAsync — removes batch + cascaded rows. Throws when the batch
  was already finalised (rollback via staging is not a time machine).
- FinaliseBatchAsync(batchId, generationId, driverInstanceId, unsLineId) —
  atomic apply. Opens an EF transaction when the provider supports it
  (SQL Server in prod; InMemory in tests skips the tx), bulk-inserts
  every accepted staging row into Equipment, stamps
  EquipmentImportBatch.FinalisedAtUtc, commits. Failure rolls back so
  Equipment never partially mutates. Idempotent-under-double-call:
  second finalise throws ImportBatchAlreadyFinalisedException.
- ListByUserAsync(createdBy, includeFinalised) — the Admin preview modal's
  backing query. OrderByDescending on CreatedAtUtc so the most-recent
  batch shows first.
- Two exception types: ImportBatchNotFoundException +
  ImportBatchAlreadyFinalisedException.

ExternalIdReservation merging (ZTag + SAPID fleet-wide uniqueness) is NOT
done here — a narrower follow-up wires it once the concurrent-insert test
matrix is green.

Tests (10 new EquipmentImportBatchServiceTests, all pass):
- CreateBatch populates Id + CreatedAtUtc + zero-ed counters.
- StageRows accepted + rejected both persist; counters advance.
- DropBatch cascades row delete.
- DropBatch after finalise throws.
- Finalise translates accepted staging rows → Equipment under the target
  GenerationId + DriverInstanceId + UnsLineId.
- Finalise twice throws.
- Finalise of unknown batch throws.
- Stage after finalise throws.
- ListByUserAsync filters by creator + finalised flag.
- Drop of unknown batch is a no-op (idempotent rollback).

Full solution dotnet test: 1235 passing (was 1225, +10). Pre-existing
Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:55:39 -04:00
98b69ff4f9 Merge pull request (#105) - ResilienceStatusPublisherHostedService 2026-04-19 14:37:53 -04:00
Joseph Doherty
016122841b Phase 6.1 Stream E.2 partial — ResilienceStatusPublisherHostedService persists tracker snapshots to DB
Closes the HostedService half of Phase 6.1 Stream E.2 flagged as a follow-up
when the DriverResilienceStatusTracker shipped in PR #82. The Admin /hosts
column refresh + SignalR push + red-badge visual (Stream E.3) remain
deferred to the visual-compliance pass — this PR owns the persistence
story alone.

Server.Hosting:
- ResilienceStatusPublisherHostedService : BackgroundService. Samples the
  DriverResilienceStatusTracker every TickInterval (default 5 s) and upserts
  each (DriverInstanceId, HostName) counter pair into
  DriverInstanceResilienceStatus via EF. New rows on first sight; in-place
  updates on subsequent ticks.
- PersistOnceAsync extracted public so tests drive one tick directly —
  matches the ScheduledRecycleHostedService pattern for deterministic
  timing.
- Best-effort persistence: a DB outage logs a warning + continues; the next
  tick retries. Never crashes the app on sample failure. Cancellation
  propagates through cleanly.
- Tracks the bulkhead depth / recycle / footprint columns the entity was
  designed for. CurrentBulkheadDepth currently persisted as 0 — the tracker
  doesn't yet expose live bulkhead depth; a narrower follow-up wires the
  Polly bulkhead-depth observer into the tracker.

Tests (6 new in ResilienceStatusPublisherHostedServiceTests):
- Empty tracker → tick is a no-op, zero rows written.
- Single-host counters → upsert a new row with ConsecutiveFailures + breaker
  timestamp + sampled timestamp.
- Second tick updates the existing row in place (not a second insert).
- Multi-host pairs persist independently.
- Footprint counters (Baseline + Current) round-trip.
- TickCount advances on every PersistOnceAsync call.

Full solution dotnet test: 1225 passing (was 1219, +6). Pre-existing
Client.CLI Subscribe flake unchanged.

Production wiring (Program.cs) example:
  builder.Services.AddSingleton<DriverResilienceStatusTracker>();
  builder.Services.AddHostedService<ResilienceStatusPublisherHostedService>();
  // Tracker gets wired into CapabilityInvoker via OtOpcUaServer resolution
  // + the existing Phase 6.1 layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:36:00 -04:00
244a36e03e Merge pull request (#104) - IPerCallHostResolver + decision #144 wire-in 2026-04-19 12:33:23 -04:00
Joseph Doherty
4de94fab0d Phase 6.1 Stream A remaining — IPerCallHostResolver + DriverNodeManager per-call host dispatch (decision #144)
Closes the per-device isolation gap flagged at the Phase 6.1 Stream A wire-up
(PR #78 used driver.DriverInstanceId as the pipeline host for every call, so
multi-host drivers like Modbus with N PLCs shared one pipeline — one dead PLC
poisoned sibling breakers). Decision #144 requires per-device isolation; this
PR wires it without breaking single-host drivers.

Core.Abstractions:
- IPerCallHostResolver interface. Optional driver capability. Drivers with
  multi-host topology (Modbus across N PLCs, AB CIP across a rack, etc.)
  implement this; single-host drivers (Galaxy, S7 against one PLC, OpcUaClient
  against one remote server) leave it alone. Must be fast + allocation-free
  — called once per tag on the hot path. Unknown refs return empty so dispatch
  falls back to single-host without throwing.

Server/OpcUa/DriverNodeManager:
- Captures `driver as IPerCallHostResolver` at construction alongside the
  existing capability casts.
- New `ResolveHostFor(fullReference)` helper returns either the resolver's
  answer or the driver's DriverInstanceId (single-host fallback). Empty /
  whitespace resolver output also falls back to DriverInstanceId.
- Every dispatch site now passes `ResolveHostFor(fullRef)` to the invoker
  instead of `_driver.DriverInstanceId` — OnReadValue, OnWriteValue, all four
  HistoryRead paths. The HistoryRead Events path tolerates fullRef=null and
  falls back to DriverInstanceId for those cluster-wide event queries.
- Drivers without IPerCallHostResolver observe zero behavioural change:
  every call still keys on DriverInstanceId, same as before.

Tests (4 new PerCallHostResolverDispatchTests, all pass):
- DeadPlc_DoesNotOpenBreaker_For_HealthyPlc_With_Resolver — 2 PLCs behind
  one driver; hammer the dead PLC past its breaker threshold; assert the
  healthy PLC's first call succeeds on its first attempt (decision #144).
- EmptyString / unknown-ref fallback behaviour documented via test.
- WithoutResolver_SameHost_Shares_One_Pipeline — regression guard for the
  single-host pre-existing behaviour.
- WithResolver_TwoHosts_Get_Two_Pipelines — builds the CachedPipelineCount
  assertion to confirm the shared-builder cache keys correctly.

Full solution dotnet test: 1219 passing (was 1215, +4). Pre-existing
Client.CLI Subscribe flake unchanged.

Adoption: Modbus driver (#120 follow-up), AB CIP / AB Legacy / TwinCAT
drivers (also #120) implement the interface and return the per-tag PLC host
string. Single-host drivers stay silent and pay zero cost.

Remaining sub-items of #160 still deferred:
- IAlarmSource.SubscribeAlarmsAsync + AcknowledgeAsync invoker wrapping.
  Non-trivial because alarm subscription is push-based from driver through
  IAlarmConditionSink — the wrap has to happen at the driver-to-server glue
  rather than a synchronous dispatch site.
- Roslyn analyzer asserting every capability-interface call routes through
  CapabilityInvoker. Substantial (separate analyzer project + test harness);
  noise-value ratio favors shipping this post-v2-GA once the coverage is
  known-stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:31:24 -04:00
fdd0bf52c3 Merge pull request (#103) - Phase 6.1 Stream A ResilienceConfig 2026-04-19 12:23:47 -04:00
Joseph Doherty
7b50118b68 Phase 6.1 Stream A follow-up — DriverInstance.ResilienceConfig JSON column + parser + OtOpcUaServer wire-in
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from
DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up
when Stream A.1 shipped in PR #78. Every driver can now override its Polly
pipeline policy per instance instead of inheriting pure tier defaults.

Configuration:
- DriverInstance entity gains a nullable `ResilienceConfig` string column
  (nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson`
  that enforces ISJSON when not null. Null = use tier defaults (decision
  #143 / unchanged from pre-Phase-6.1).
- EF migration `20260419161008_AddDriverInstanceResilienceConfig`.
- SchemaComplianceTests expected-constraint list gains the new CK name.

Core.Resilience.DriverResilienceOptionsParser:
- Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the
  effective DriverResilienceOptions — tier defaults with per-capability /
  bulkhead overrides layered on top when the JSON payload supplies them.
  Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from
  the tier default for that capability.
- Malformed JSON falls back to pure tier defaults + surfaces a human-readable
  diagnostic via the out parameter. Callers log the diag but don't fail
  startup — a misconfigured ResilienceConfig must not brick a working
  driver.
- Property names + capability keys are case-insensitive; unrecognised
  capability names are logged-and-skipped; unrecognised shape-level keys
  are ignored so future shapes land without a migration.

Server wire-in:
- OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType →
  DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string).
  CreateMasterNodeManager now resolves tier + JSON for each driver, parses
  via DriverResilienceOptionsParser, logs the diagnostic if any, and
  constructs CapabilityInvoker with the merged options instead of pure
  Tier A defaults.
- OpcUaApplicationHost threads both lookups through. Default null keeps
  existing tests constructing without either Func unchanged (falls back
  to Tier A + tier defaults exactly as before).

Tests (13 new DriverResilienceOptionsParserTests):
- null / whitespace / empty-object JSON returns pure tier defaults.
- Malformed JSON falls back + surfaces diagnostic.
- Read override merged into tier defaults; other capabilities untouched.
- Partial policy fills missing fields from tier default.
- Bulkhead overrides honored.
- Unknown capability skipped + surfaced in diagnostic.
- Property names + capability keys are case-insensitive.
- Every tier × every capability × empty-JSON round-trips tier defaults
  exactly (theory).

Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing
Client.CLI Subscribe flake unchanged.

Production wiring (Program.cs) example:
  Func<string, DriverTier> tierLookup = type => type switch
  {
      "Galaxy" => DriverTier.C,
      "Modbus" or "S7" => DriverTier.B,
      "OpcUaClient" => DriverTier.A,
      _ => DriverTier.A,
  };
  Func<string, string?> cfgLookup = id =>
      db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig;
  var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup);

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:21:42 -04:00
eac457fa7c Merge pull request (#102) - Phase 6.4 Stream D server-side 2026-04-19 11:59:36 -04:00
Joseph Doherty
c1cab33e38 Phase 6.4 Stream D server-side — IdentificationFolderBuilder materializes OPC 40010 Machinery Identification sub-folder
Closes the server-side / non-UI piece of Phase 6.4 Stream D. The Razor
`IdentificationFields.razor` component for Admin-UI editing ships separately
when the Admin UI pass lands (still tracked under #157 UI follow-up).

Core.OpcUa additions:
- IdentificationFolderBuilder — pure-function builder that materializes the
  OPC 40010 Machinery companion-spec Identification sub-folder per decision
  #139. Reads the nine nullable columns off an Equipment row:
  Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision,
  YearOfConstruction (short → OPC UA Int32), AssetLocation, ManufacturerUri,
  DeviceManualUri. Emits one AddProperty call per non-null field; skips the
  sub-folder entirely when all nine are null so browse trees don't carry
  pointless empty folders.
- HasAnyFields(equipment) — cheap short-circuit so callers can decide
  whether to invoke Folder() at all.
- FolderName constant ("Identification") + FieldNames list exposed so
  downstream tools / tests can cross-reference without duplicating the
  decision-#139 field set.

ACL binding: the sub-folder + variables live under the Equipment node so
Phase 6.2's PermissionTrie treats them as part of the Equipment ScopeId —
no new scope level. A user with Equipment-level grant reads the
Identification fields; a user without gets BadUserAccessDenied on both the
Equipment node + its Identification variables. Documented in the class
remarks; cross-reference update to acl-design.md is a follow-up.

Tests (9 new IdentificationFolderBuilderTests):
- HasAnyFields all-null false / any-non-null true.
- Build all-null returns null + doesn't emit Folder.
- Build fully-populated emits all 9 fields in decision #139 order.
- Only non-null fields are emitted (3-of-9 case).
- YearOfConstruction short widens to DriverDataType.Int32 with int value.
- String values round-trip through AddProperty.
- FieldNames constant matches decision #139 exactly.
- FolderName is "Identification".

Full solution dotnet test: 1202 passing (was 1193, +9). Pre-existing
Client.CLI Subscribe flake unchanged.

Production integration: the component that consumes this is the
address-space-build flow that walks the live Equipment table + calls
IdentificationFolderBuilder.Build(equipmentFolder, equipment) under each
Equipment node. That integration is the remaining Stream D follow-up
alongside the Razor UI component.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:57:39 -04:00
0c903ff4e0 Merge pull request (#101) - Phase 6.1 Stream B.4 hosted service 2026-04-19 11:44:16 -04:00
Joseph Doherty
c4a92f424a Phase 6.1 Stream B.4 follow-up — ScheduledRecycleHostedService drives registered schedulers on a fixed tick
Turns the Phase 6.1 Stream B.4 pure-logic ScheduledRecycleScheduler (shipped
in PR #79) into a running background feature. A Tier C driver registers its
scheduler at startup; the hosted service ticks every TickInterval (default
1 min) and invokes TickAsync on each registered scheduler.

Server.Hosting:
- ScheduledRecycleHostedService : BackgroundService. AddScheduler(s) must be
  called before StartAsync — registering post-start throws
  InvalidOperationException to avoid "some ticks saw my scheduler, some
  didn't" races. ExecuteAsync loops on Task.Delay(TickInterval, _timeProvider,
  stoppingToken) + delegates to a public TickOnceAsync method for one tick.
- TickOnceAsync extracted as the unit-of-work so tests drive it directly
  without needing to synchronize with FakeTimeProvider + BackgroundService
  timing semantics.
- Exception isolation: if one scheduler throws, the loop logs + continues
  to the next scheduler. A flaky supervisor can't take down the tick for
  every other Tier C driver.
- Diagnostics: TickCount + SchedulerCount properties for tests + logs.

Tests (7 new ScheduledRecycleHostedServiceTests, all pass):
- TickOnce before interval doesn't fire; TickCount still advances.
- TickOnce at/after interval fires the underlying scheduler exactly once.
- Multiple ticks accumulate count.
- AddScheduler after StartAsync throws.
- Throwing scheduler doesn't poison its neighbours (logs + continues).
- SchedulerCount matches registrations.
- Empty scheduler list ticks cleanly (no-op + counter advances).

Full solution dotnet test: 1193 passing (was 1186, +7). Pre-existing
Client.CLI Subscribe flake unchanged.

Production wiring (Program.cs):
  builder.Services.AddSingleton<ScheduledRecycleHostedService>();
  builder.Services.AddHostedService(sp => sp.GetRequiredService<ScheduledRecycleHostedService>());
  // During DI configuration, once Tier C drivers + their ScheduledRecycleSchedulers
  // are resolved, call host.AddScheduler(scheduler) for each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:42:08 -04:00
510e488ea4 Merge pull request (#100) - Readiness doc all blockers closed 2026-04-19 11:35:34 -04:00
8994e73a0b Merge pull request (#99) - Phase 6.3 Stream C core 2026-04-19 11:33:49 -04:00
Joseph Doherty
e71f44603c v2 release-readiness — blocker #3 closed; all three code-path blockers shut
Phase 6.3 Streams A + C core shipped (PRs #98-99):
- RedundancyCoordinator + ClusterTopologyLoader read the shared config DB +
  enforce the Phase 6.3 invariants (1-2 nodes, unique ApplicationUri, ≤1
  Primary in Warm/Hot). Startup fails fast on violation.
- RedundancyStatePublisher orchestrates topology + apply lease + recovery
  state + peer reachability through ServiceLevelCalculator. Edge-triggered
  OnStateChanged + OnServerUriArrayChanged events the OPC UA variable-node
  layer subscribes to.

Doc updates:
- Top status flips from NOT YET RELEASE-READY → RELEASE-READY (code-path).
  Remaining work is manual (client interop matrix, deployment signoff,
  OPC UA CTT pass) + hardening follow-ups that don't block v2 GA ship.
- Release-blocker #3 section struck through + CLOSED with PR links.
  Remaining Phase 6.3 surfaces (peer-probe HostedServices, OPC UA
  variable-node binding, sp_PublishGeneration lease wrap, client interop)
  explicitly listed as hardening follow-ups.
- Change log: new dated entry.

All three release blockers identified at the capstone are closed:
- #1 Phase 6.2 dispatch wiring  → PR #94 (2026-04-19)
- #2 Phase 6.1 Stream D wiring  → PR #96 (2026-04-19)
- #3 Phase 6.3 Streams A/C core → PRs #98-99 (2026-04-19)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:33:37 -04:00
Joseph Doherty
c4824bea12 Phase 6.3 Stream C core — RedundancyStatePublisher + PeerReachability; orchestrates calculator inputs end-to-end
Wires the Phase 6.3 Stream B pure-logic pieces (ServiceLevelCalculator,
RecoveryStateManager, ApplyLeaseRegistry) + Stream A topology loader
(RedundancyCoordinator) into one orchestrator the runtime + OPC UA node
surface consume. The actual OPC UA variable-node plumbing (mapping
ServiceLevel Byte + ServerUriArray String[] onto the Opc.Ua.Server stack)
is narrower follow-up on top of this — the publisher emits change events
the OPC UA layer subscribes to.

Server.Redundancy additions:
- PeerReachability record + PeerReachabilityTracker — thread-safe
  per-peer-NodeId holder of the latest (HttpHealthy, UaHealthy) tuple. Probe
  loops (Stream B.1/B.2 runtime follow-up) write via Update; the publisher
  reads via Get. PeerReachability.FullyHealthy / Unknown sentinels for the
  two most-common states.
- RedundancyStatePublisher — pure orchestrator, no background timer, no OPC
  UA stack dep. ComputeAndPublish reads the 6 inputs + calls the calculator:
    * role (from coordinator.Current.SelfRole)
    * selfHealthy (caller-supplied Func<bool>)
    * peerHttpHealthy + peerUaHealthy (aggregate across all peers in
      coordinator.Current.Peers)
    * applyInProgress (ApplyLeaseRegistry.IsApplyInProgress)
    * recoveryDwellMet (RecoveryStateManager.IsDwellMet)
    * topologyValid (coordinator.IsTopologyValid)
    * operatorMaintenance (caller-supplied Func<bool>)
  Before-coordinator-init returns NoData=1 so clients never see an
  authoritative value from an un-bootstrapped server.
  OnStateChanged event fires edge-triggered when the byte changes;
  OnServerUriArrayChanged fires edge-triggered when the topology's self-first
  peer-sorted URI array content changes.
- ServiceLevelSnapshot record — per-tick output with Value + Band +
  Topology. The OPC UA layer's ServiceLevel Byte node subscribes to
  OnStateChanged; the ServerUriArray node subscribes to OnServerUriArrayChanged.

Tests (8 new RedundancyStatePublisherTests, all pass):
- Before-init returns NoData (Value=1, Band=NoData).
- Authoritative-Primary when healthy + peer fully reachable.
- Isolated-Primary (230) retains authority when peer unreachable — matches
  decision #154 non-promotion semantics.
- Mid-apply band dominates: open lease → Value=200 even with peer healthy.
- Self-unhealthy → NoData regardless of other inputs.
- OnStateChanged fires only on value transitions (edge-triggered).
- OnServerUriArrayChanged fires once per topology content change; repeat
  ticks with same topology don't re-emit.
- Standalone cluster treats healthy as AuthoritativePrimary=255.

Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Server.Tests for the
coordinator-backed publisher tests.

Full solution dotnet test: 1186 passing (was 1178, +8). Pre-existing
Client.CLI Subscribe flake unchanged.

Closes the core of release blocker #3 — the pure-logic + orchestration
layer now exists + is unit-tested. Remaining Stream C surfaces: OPC UA
ServiceLevel Byte variable wiring (binds to OnStateChanged), ServerUriArray
String[] wiring (binds to OnServerUriArrayChanged), RedundancySupport
static from RedundancyMode. Those touch the OPC UA stack directly + land
as Stream C.2 follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:31:50 -04:00
e588c4f980 Merge pull request (#98) - Phase 6.3 Stream A topology loader 2026-04-19 11:26:11 -04:00
Joseph Doherty
84fe88fadb Phase 6.3 Stream A — RedundancyTopology + ClusterTopologyLoader + RedundancyCoordinator
Lands the data path that feeds the Phase 6.3 ServiceLevelCalculator shipped in
PR #89. OPC UA node wiring (ServiceLevel variable + ServerUriArray +
RedundancySupport) still deferred to task #147; peer-probe loops (Stream B.1/B.2
runtime layer beyond the calculator logic) deferred.

Server.Redundancy additions:
- RedundancyTopology record — immutable snapshot (ClusterId, SelfNodeId,
  SelfRole, Mode, Peers[], SelfApplicationUri). ServerUriArray() emits the
  OPC UA Part 4 §6.6.2.2 shape (self first, peers lexicographically by
  NodeId). RedundancyPeer record with per-peer Host/OpcUaPort/DashboardPort/
  ApplicationUri so the follow-up peer-probe loops know where to probe.
- ClusterTopologyLoader — pure fn from ServerCluster + ClusterNode[] to
  RedundancyTopology. Enforces Phase 6.3 Stream A.1 invariants:
    * At least one node per cluster.
    * At most 2 nodes (decision #83, v2.0 cap).
    * Every node belongs to the target cluster.
    * Unique ApplicationUri across the cluster (OPC UA Part 4 trust pin,
      decision #86).
    * At most 1 Primary per cluster in Warm/Hot modes (decision #84).
    * Self NodeId must be a member of the cluster.
  Violations throw InvalidTopologyException with a decision-ID-tagged message
  so operators know which invariant + what to fix.
- RedundancyCoordinator singleton — holds the current topology + IsTopologyValid
  flag. InitializeAsync throws on invariant violation (startup fails fast).
  RefreshAsync logs + flips IsTopologyValid=false (runtime won't tear down a
  running server; ServiceLevelCalculator falls to InvalidTopology band = 2
  which surfaces the problem to clients without crashing). CAS-style swap
  via Volatile.Write so readers always see a coherent snapshot.

Tests (10 new ClusterTopologyLoaderTests):
- Single-node standalone loads + empty peer list.
- Two-node cluster loads self + peer.
- ServerUriArray puts self first + peers sort lexicographically.
- Empty-nodes throws.
- Self-not-in-cluster throws.
- Three-node cluster rejected with decision #83 message.
- Duplicate ApplicationUri rejected with decision #86 shape reference.
- Two Primaries in Warm mode rejected (decision #84 + runtime-band reference).
- Cross-cluster node rejected.
- None-mode allows any role mix (standalone clusters don't enforce Primary count).

Full solution dotnet test: 1178 passing (was 1168, +10). Pre-existing
Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:24:14 -04:00
59f793f87c Merge pull request (#97) - Readiness doc blocker2 closed 2026-04-19 11:18:26 -04:00
37ba9e8d14 Merge pull request (#96) - Phase 6.1 Stream D wiring follow-up 2026-04-19 11:16:57 -04:00
Joseph Doherty
a8401ab8fd v2 release-readiness — blocker #2 closed; doc reflects state
PR #96 closed the Phase 6.1 Stream D config-cache wiring blocker.

- Status line: "one of three release blockers remains".
- Blocker #2 struck through + CLOSED with PR link. Periodic-poller + richer-
  snapshot-payload follow-ups downgraded to hardening.
- Change log: dated entry.

One blocker remains: Phase 6.3 Streams A/C/F redundancy runtime (tasks
#145, #147, #150).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:16:31 -04:00
Joseph Doherty
19a0bfcc43 Phase 6.1 Stream D follow-up — SealedBootstrap consumes ResilientConfigReader + GenerationSealedCache + StaleConfigFlag; /healthz surfaces the flag
Closes release blocker #2 from docs/v2/v2-release-readiness.md — the
generation-sealed cache + resilient reader + stale-config flag shipped as
unit-tested primitives in PR #81, but no production path consumed them until
now. This PR wires them end-to-end.

Server additions:
- SealedBootstrap — Phase 6.1 Stream D consumption hook. Resolves the node's
  current generation through ResilientConfigReader's timeout → retry →
  fallback-to-sealed pipeline. On every successful central-DB fetch it seals
  a fresh snapshot to <cache-root>/<cluster>/<generationId>.db so a future
  cache-miss has a known-good fallback. Alongside the original NodeBootstrap
  (which still uses the single-file ILocalConfigCache); Program.cs can
  switch between them once operators are ready for the generation-sealed
  semantics.
- OpcUaApplicationHost: new optional staleConfigFlag ctor parameter. When
  wired, HealthEndpointsHost consumes `flag.IsStale` via the existing
  usingStaleConfig Func<bool> hook. Means `/healthz` actually reports
  `usingStaleConfig: true` whenever a read fell back to the sealed cache —
  closes the loop between Stream D's flag + Stream C's /healthz body shape.

Tests (4 new SealedBootstrapIntegrationTests, all pass):
- Central-DB success path seals snapshot + flag stays fresh.
- Central-DB failure falls back to sealed snapshot + flag flips stale (the
  SQL-kill scenario from Phase 6.1 Stream D.4.a).
- No-snapshot + central-down throws GenerationCacheUnavailableException
  with a clear error (the first-boot scenario from D.4.c).
- Next successful bootstrap after a fallback clears the stale flag.

Full solution dotnet test: 1168 passing (was 1164, +4). Pre-existing
Client.CLI Subscribe flake unchanged.

Production activation: Program.cs wires SealedBootstrap (instead of
NodeBootstrap), constructs OpcUaApplicationHost with the staleConfigFlag,
and a HostedService polls sp_GetCurrentGenerationForCluster periodically so
peer-published generations land in this node's sealed cache. The poller
itself is Stream D.1.b follow-up.

The sp_PublishGeneration SQL-side hook (where the publish commit itself
could also write to a shared sealed cache) stays deferred — the per-node
seal pattern shipped here is the correct v2 GA model: each Server node
owns its own on-disk cache and refreshes from its own DB reads, matching
the Phase 6.1 scope-table description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:14:59 -04:00
fc7e18c7f5 Merge pull request (#95) - Readiness doc blocker1 closed 2026-04-19 11:06:28 -04:00
Joseph Doherty
ba42967943 v2 release-readiness — blocker #1 closed; doc reflects state
PR #94 closed the Phase 6.2 dispatch wiring blocker. Update the dashboard:
- Status line: "two of three release blockers remain".
- Release-blocker #1 section struck through + marked CLOSED with PR link.
  Remaining Stream C surfaces (Browse / Subscribe / Alarm / Call + finer-
  grained scope resolution) downgraded to hardening follow-ups — not
  release-blocking.
- Change log: new dated entry.

Two remaining blockers: Phase 6.1 Stream D config-cache wiring (task #136)
+ Phase 6.3 Streams A/C/F redundancy runtime (tasks #145, #147, #150).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:04:30 -04:00
b912969805 Merge pull request (#94) - Phase 6.2 Stream C follow-up dispatch wiring 2026-04-19 11:04:20 -04:00
Joseph Doherty
f8d5b0fdbb Phase 6.2 Stream C follow-up — wire AuthorizationGate into DriverNodeManager Read / Write / HistoryRead dispatch
Closes the Phase 6.2 security gap the v2 release-readiness dashboard flagged:
the evaluator + trie + gate shipped as code in PRs #84-88 but no dispatch
path called them. This PR threads the gate end-to-end from
OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager and calls it on
every Read / Write / 4 HistoryRead paths.

Server.Security additions:
- NodeScopeResolver — maps driver fullRef → Core.Authorization NodeScope.
  Phase 1 shape: populates ClusterId + TagId; leaves NamespaceId / UnsArea /
  UnsLine / Equipment null. The cluster-level ACL cascade covers this
  configuration (decision #129 additive grants). Finer-grained scope
  resolution (joining against the live Configuration DB for UnsArea / UnsLine
  path) lands as Stream C.12 follow-up.
- WriteAuthzPolicy.ToOpcUaOperation — maps SecurityClassification → the
  OpcUaOperation the gate evaluator consults (Operate/SecuredWrite →
  WriteOperate; Tune → WriteTune; Configure/VerifiedWrite → WriteConfigure).

DriverNodeManager wiring:
- Ctor gains optional AuthorizationGate + NodeScopeResolver; both null means
  the pre-Phase-6.2 dispatch runs unchanged (backwards-compat for every
  integration test that constructs DriverNodeManager directly).
- OnReadValue: ahead of the invoker call, builds NodeScope + calls
  gate.IsAllowed(identity, Read, scope). Denied reads return
  BadUserAccessDenied without hitting the driver.
- OnWriteValue: preserves the existing WriteAuthzPolicy check (classification
  vs session roles) + adds an additive gate check using
  WriteAuthzPolicy.ToOpcUaOperation(classification) to pick the right
  WriteOperate/Tune/Configure surface. Lax mode falls through for identities
  without LDAP groups.
- Four HistoryRead paths (Raw / Processed / AtTime / Events): gate check
  runs per-node before the invoker. Events path tolerates fullRef=null
  (event-history queries can target a notifier / driver-root; those are
  cluster-wide reads that need a different scope shape — deferred).
- New WriteAccessDenied helper surfaces BadUserAccessDenied in the
  OpcHistoryReadResult slot + errors list, matching the shape of the
  existing WriteUnsupported / WriteInternalError helpers.

OtOpcUaServer + OpcUaApplicationHost: gate + resolver thread through as
optional constructor parameters (same pattern as DriverResiliencePipelineBuilder
in Phase 6.1). Null defaults keep the existing 3 OpcUaApplicationHost
integration tests constructing without them unchanged.

Tests (5 new in NodeScopeResolverTests):
- Resolve populates ClusterId + TagId + Equipment Kind.
- Resolve leaves finer path null per Phase 1 shape (doc'd as follow-up).
- Empty fullReference throws.
- Empty clusterId throws at ctor.
- Resolver is stateless across calls.

The existing 9 AuthorizationGate tests (shipped in PR #86) continue to
cover the gate's allow/deny semantics under strict + lax mode.

Full solution dotnet test: 1164 passing (was 1159, +5). Pre-existing
Client.CLI Subscribe flake unchanged. Existing OpcUaApplicationHost +
HealthEndpointsHost + driver integration tests continue to pass because the
gate defaults to null → no enforcement, and the lax-mode fallback returns
true for identities without LDAP groups (the anonymous test path).

Production deployments flip the gate on by constructing it via
OpcUaApplicationHost's new authzGate parameter + setting
`Authorization:StrictMode = true` once ACL data is populated. Flipping the
switch post-seed turns the evaluator + trie from scaffolded code into
actual enforcement.

This closes release blocker #1 listed in docs/v2/v2-release-readiness.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 11:02:17 -04:00
cc069509cd Merge pull request (#93) - v2 release-readiness capstone 2026-04-19 10:34:17 -04:00
Joseph Doherty
3b2d0474a7 v2 release-readiness capstone — aggregate compliance runner + release-readiness dashboard
Closes out Phase 6 with the two pieces a release engineer needs before
tagging v2 GA:

1. scripts/compliance/phase-6-all.ps1 — meta-runner that invokes every
   per-phase Phase 6.N compliance script in sequence + aggregates results.
   Each sub-script runs in its own powershell.exe child process so per-script
   $ErrorActionPreference + exit semantics can't interfere with the parent.
   Exit 0 = every phase passes; exit 1 = one or more phases failed. Prints a
   PASS/FAIL summary matrix at the end.

2. docs/v2/v2-release-readiness.md — single-view dashboard of everything
   shipped + everything still deferred + release exit criteria. Called out
   explicitly:
   - Three release BLOCKERS (must close before v2 GA):
     * Phase 6.2 Stream C dispatch wiring — AuthorizationGate exists but no
       DriverNodeManager Read/Write/etc. path calls it (task #143).
     * Phase 6.1 Stream D follow-up — ResilientConfigReader + sealed-cache
       hook not yet consumed by any read path (task #136).
     * Phase 6.3 Streams A/C/F — coordinator + UA-node wiring + client
       interop still deferred (tasks #145, #147, #150).
   - Three nice-to-haves (not release-blocking) — Admin UI polish, background
     services, multi-host dispatch.
   - Release exit criteria: all 4 compliance scripts exit 0, dotnet test ≤ 1
     known flake, blockers closed or v2.1-deferred with written decision,
     Fleet Admin signoff on deployment checklist, live-Galaxy smoke test,
     OPC UA CTT pass, redundancy cutover validated with at least one
     production client.
   - Change log at the bottom so future ships of deferred follow-ups just
     append dates + close out dashboard rows.

Meta-runner verified locally:
  Phase 6.1 — PASS
  Phase 6.2 — PASS
  Phase 6.3 — PASS
  Phase 6.4 — PASS
  Aggregate: PASS (elapsed 340 s — most of that is the full solution
  `dotnet test` each phase runs).

Net counts at capstone time: 906 baseline → 1159 passing across Phase 6
(+253). 15 deferred follow-up tasks tracked with IDs (#134-137, #143-144,
#145, #147, #149-150, #153, #155-157). v2 is NOT YET release-ready —
capstone makes that explicit rather than letting the "shipped" label on
each phase imply full readiness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:32:21 -04:00
e1d38ecc66 Merge pull request (#92) - Phase 6.4 exit gate 2026-04-19 10:15:46 -04:00
Joseph Doherty
99cf1197c5 Phase 6.4 exit gate — compliance real-checks + phase doc = SHIPPED (data layer)
scripts/compliance/phase-6-4-compliance.ps1 turns stub TODOs into 11 real
checks covering:
- Stream A data layer: UnsImpactAnalyzer + DraftRevisionToken + cross-cluster
  rejection (decision #82) + all three move kinds (LineMove / AreaRename /
  LineMerge).
- Stream B data layer: EquipmentCsvImporter + version marker
  '# OtOpcUaCsv v1' + decision-#117 required columns + decision-#139
  optional columns including DeviceManualUri + duplicate-ZTag rejection +
  unknown-column rejection.

Four [DEFERRED] surfaces tracked explicitly with task IDs:
  - Stream A UI drag/drop (task #153)
  - Stream B staging + finalize + UI (task #155)
  - Stream C DiffViewer refactor (task #156)
  - Stream D OPC 40010 Identification sub-folder + Razor component (task #157)

Cross-cutting: full solution dotnet test passes 1159 >= 1137 pre-Phase-6.4
baseline; pre-existing Client.CLI Subscribe flake tolerated.

docs/v2/implementation/phase-6-4-admin-ui-completion.md status updated from
DRAFT to SHIPPED (data layer). Four Blazor / SignalR / EF / address-space
follow-ups tracked as tasks — the visual-compliance review pattern from
Phase 6.1 Stream E applies to each.

`Phase 6.4 compliance: PASS` — exit 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:13:46 -04:00
ad39f866e5 Merge pull request (#91) - Phase 6.4 Stream A + B data layer 2026-04-19 10:11:44 -04:00
Joseph Doherty
560a961cca Phase 6.4 Stream A + B data layer — UnsImpactAnalyzer + EquipmentCsvImporter (parser)
Ships the pure-logic data layer of Phase 6.4. Blazor UI pieces
(UnsTab drag/drop, CSV import modal, preview table, FinaliseImportBatch txn,
staging tables) are deferred to visual-compliance follow-ups (tasks #153,
#155, #157).

Admin.Services additions:

- UnsImpactAnalyzer.Analyze(snapshot, move) — pure-function, no I/O. Three
  move variants: LineMove, AreaRename, LineMerge. Returns UnsImpactPreview
  with AffectedEquipmentCount + AffectedTagCount + CascadeWarnings +
  RevisionToken + HumanReadableSummary the Admin UI shows in the confirm
  modal. Cross-cluster moves rejected with CrossClusterMoveRejectedException
  per decision #82. Missing source/target throws UnsMoveValidationException.
  Surfaces sibling-line same-name ambiguity as a cascade warning.
- DraftRevisionToken — opaque revision fingerprint. Preview captures the
  token; Confirm compares it. The 409-concurrent-edit UX plumbs through on
  the Razor-page follow-up (task #153). Matches(other) is null-safe.
- UnsTreeSnapshot + UnsAreaSummary + UnsLineSummary — snapshot shape the
  caller hands to the analyzer. Tests build them in-memory without a DB.

- EquipmentCsvImporter.Parse(csvText) — RFC 4180 CSV parser per decision #95.
  Version-marker contract: line 1 must be "# OtOpcUaCsv v1" (future shapes
  bump the version). Required columns from decision #117 + optional columns
  from decision #139. Rejects unknown columns, duplicate column names,
  blank required fields, duplicate ZTags within the file. Quoted-field
  handling supports embedded commas + escaped "" quotes. Returns
  EquipmentCsvParseResult { AcceptedRows, RejectedRows } so the preview
  modal renders accept/reject counts without re-parsing.

Tests (22 new, all pass):

- UnsImpactAnalyzerTests (9): line move counts equipment + tags; cross-
  cluster throws; unknown source/target throws validation; ambiguous same-
  name target raises warning; area rename sums across lines; line merge
  cross-area warns; same-area merge no warning; DraftRevisionToken matches
  semantics.
- EquipmentCsvImporterTests (13): empty file throws; missing version marker;
  missing required column; unknown column; duplicate column; valid single
  row round-trips; optional columns populate when present; blank required
  field rejects row; duplicate ZTag rejects second; RFC 4180 quoted fields
  with commas + escaped quotes; mismatched column count rejects; blank
  lines between rows ignored; required + optional column constants match
  decisions #117 + #139 exactly.

Full solution dotnet test: 1159 passing (Phase 6.3 = 1137, Phase 6.4 A+B
data = +22). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:09:47 -04:00
4901b78e9a Merge pull request (#90) - Phase 6.3 exit gate 2026-04-19 10:02:25 -04:00
Joseph Doherty
2fe4bac508 Phase 6.3 exit gate — compliance real-checks + phase doc = SHIPPED (core)
scripts/compliance/phase-6-3-compliance.ps1 turns stub TODOs into 21 real
checks covering:
- Stream B 8-state matrix: ServiceLevelCalculator + ServiceLevelBand present;
  Maintenance=0, NoData=1, InvalidTopology=2, AuthoritativePrimary=255,
  IsolatedPrimary=230, PrimaryMidApply=200, RecoveringPrimary=180,
  AuthoritativeBackup=100, IsolatedBackup=80, BackupMidApply=50,
  RecoveringBackup=30 — every numeric band pattern-matched in source (any
  drift turns a check red).
- Stream B RecoveryStateManager with dwell + publish-witness gate + 60s
  default dwell.
- Stream D ApplyLeaseRegistry: BeginApplyLease returns IAsyncDisposable;
  key includes PublishRequestId (decision #162); PruneStale watchdog present;
  10 min default ApplyMaxDuration.

Five [DEFERRED] follow-up surfaces explicitly listed with task IDs:
  - Stream A topology loader (task #145)
  - Stream C OPC UA node wiring (task #147)
  - Stream E Admin UI (task #149)
  - Stream F interop + Galaxy failover (task #150)
  - sp_PublishGeneration Transparent-mode rejection (task #148 part 2)

Cross-cutting: full solution dotnet test passes 1137 >= 1097 pre-Phase-6.3
baseline; pre-existing Client.CLI Subscribe flake tolerated.

docs/v2/implementation/phase-6-3-redundancy-runtime.md status updated from
DRAFT to SHIPPED (core). Non-transparent redundancy per decision #84 keeps
role election out of scope — operator-driven failover is the v2.0 model.

`Phase 6.3 compliance: PASS` — exit 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 10:00:30 -04:00
eb3625b327 Merge pull request (#89) - Phase 6.3 Stream B + D core 2026-04-19 09:58:33 -04:00
Joseph Doherty
483f55557c Phase 6.3 Stream B + Stream D (core) — ServiceLevelCalculator + RecoveryStateManager + ApplyLeaseRegistry
Lands the pure-logic heart of Phase 6.3. OPC UA node wiring (Stream C),
RedundancyCoordinator topology loader (Stream A), Admin UI + metrics (Stream E),
and client interop tests (Stream F) are follow-up work — tracked as
tasks #145-150.

New Server.Redundancy sub-namespace:

- ServiceLevelCalculator — pure 8-state matrix per decision #154. Inputs:
  role, selfHealthy, peerUa/HttpHealthy, applyInProgress, recoveryDwellMet,
  topologyValid, operatorMaintenance. Output: OPC UA Part 5 §6.3.34 Byte.
  Reserved bands (0=Maintenance, 1=NoData, 2=InvalidTopology) override
  everything; operational bands occupy 30..255.
  Key invariants:
    * Authoritative-Primary = 255, Authoritative-Backup = 100.
    * Isolated-Primary = 230 (retains authority with peer down).
    * Isolated-Backup = 80 (does NOT auto-promote — non-transparent model).
    * Primary-Mid-Apply = 200, Backup-Mid-Apply = 50; apply dominates
      peer-unreachable per Stream C.4 integration expectation.
    * Recovering-Primary = 180, Recovering-Backup = 30.
    * Standalone treats healthy as Authoritative-Primary (no peer concept).
- ServiceLevelBand enum — labels every numeric band for logs + Admin UI.
  Values match the calculator table exactly; compliance script asserts
  drift detection.
- RecoveryStateManager — holds Recovering band until (dwell ≥ 60s default)
  AND (one publish witness observed). Re-fault resets both gates so a
  flapping node doesn't shortcut through recovery twice.
- ApplyLeaseRegistry — keyed on (ConfigGenerationId, PublishRequestId) per
  decision #162. BeginApplyLease returns an IAsyncDisposable so every exit
  path (success, exception, cancellation, dispose-twice) closes the lease.
  ApplyMaxDuration watchdog (10 min default) via PruneStale tick forces
  close after a crashed publisher so ServiceLevel can't stick at mid-apply.

Tests (40 new, all pass):
- ServiceLevelCalculatorTests (27): reserved bands override; self-unhealthy
  → NoData; invalid topology demotes both nodes to 2; authoritative primary
  255; backup 100; isolated primary 230 retains authority; isolated backup
  80 does not promote; http-only unreachable triggers isolated; mid-apply
  primary 200; mid-apply backup 50; apply dominates peer-unreachable; recovering
  primary 180; recovering backup 30; standalone treats healthy as 255;
  classify round-trips every band including Unknown sentinel.
- RecoveryStateManagerTests (6): never-faulted auto-meets dwell; faulted-only
  returns true (semantics-doc test — coordinator short-circuits on
  selfHealthy=false); recovered without witness never meets; witness without
  dwell never meets; witness + dwell-elapsed meets; re-fault resets.
- ApplyLeaseRegistryTests (7): empty registry not-in-progress; begin+dispose
  closes; dispose on exception still closes; dispose twice safe; concurrent
  leases isolated; watchdog closes stale; watchdog leaves recent alone.

Full solution dotnet test: 1137 passing (Phase 6.2 shipped at 1097, Phase 6.3
B + D core = +40 = 1137). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:56:34 -04:00
d269dcaa1b Merge pull request (#88) - Phase 6.2 exit gate 2026-04-19 09:47:58 -04:00
Joseph Doherty
bd53ebd192 Phase 6.2 exit gate — compliance script real-checks + phase doc = SHIPPED (core)
scripts/compliance/phase-6-2-compliance.ps1 replaces the stub TODOs with 23
real checks spanning:
- Stream A: LdapGroupRoleMapping entity + AdminRole enum + ILdapGroupRoleMappingService
  + impl + write-time invariant + EF migration all present.
- Stream B: OpcUaOperation enum + NodeScope + AuthorizationDecision tri-state
  + IPermissionEvaluator + PermissionTrie + Builder + Cache keyed on
  GenerationId + UserAuthorizationState with MembershipFreshnessInterval=15m
  and AuthCacheMaxStaleness=5m + TriePermissionEvaluator + HistoryRead uses
  its own flag.
- Control/data-plane separation: the evaluator + trie + cache + builder +
  interface all have zero references to LdapGroupRoleMapping (decision #150).
- Stream C foundation: ILdapGroupsBearer + AuthorizationGate with StrictMode
  knob. DriverNodeManager dispatch-path wiring (11 surfaces) is Deferred,
  tracked as task #143.
- Stream D data layer: ValidatedNodeAclAuthoringService + exception type +
  rejects None permissions. Blazor UI pieces (RoleGrantsTab, AclsTab,
  SignalR invalidation, draft diff) are Deferred, tracked as task #144.
- Cross-cutting: full solution dotnet test runs; 1097 >= 1042 baseline;
  tolerates the one pre-existing Client.CLI Subscribe flake.

IPermissionEvaluator doc-comment reworded to avoid mentioning the literal
type name "LdapGroupRoleMapping" — the compliance check does a text-absence
sweep for that identifier across the data-plane files.

docs/v2/implementation/phase-6-2-authorization-runtime.md status updated from
DRAFT to SHIPPED (core). Two deferred follow-ups explicitly called out so
operators see what's still pending for the "Phase 6.2 fully wired end-to-end"
milestone.

`Phase 6.2 compliance: PASS` — exit 0. Any regression that deletes a class
or re-introduces an LdapGroupRoleMapping reference into the data-plane
evaluator turns a green check red + exit non-zero.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:45:58 -04:00
565032cf71 Merge pull request (#87) - Phase 6.2 Stream D data layer 2026-04-19 09:41:02 -04:00
Joseph Doherty
3b8280f08a Phase 6.2 Stream D (data layer) — ValidatedNodeAclAuthoringService with write-time invariants
Ships the non-UI piece of Stream D: a draft-aware write surface over NodeAcl
that enforces the Phase 6.2 plan's scope-uniqueness + grant-shape invariants.
Blazor UI pieces (RoleGrantsTab + AclsTab refresh + SignalR invalidation +
visual-compliance reviewer signoff) are deferred to the Phase 6.1-style
follow-up task.

Admin.Services:
- ValidatedNodeAclAuthoringService — alongside existing NodeAclService (raw
  CRUD, kept for read + revoke paths). GrantAsync enforces:
    * Permissions != None (decision #129 — additive only, no empty grants).
    * Cluster scope has null ScopeId.
    * Sub-cluster scope requires a populated ScopeId.
    * No duplicate (GenerationId, ClusterId, LdapGroup, ScopeKind, ScopeId)
      tuple — operator updates the row instead of inserting a duplicate.
  UpdatePermissionsAsync also rejects None (operator revokes via NodeAclService).
  Violations throw InvalidNodeAclGrantException.

Tests (10 new in Admin.Tests/ValidatedNodeAclAuthoringServiceTests):
- Grant rejects None permissions.
- Grant rejects Cluster-scope with ScopeId / sub-cluster without ScopeId.
- Grant succeeds on well-formed row.
- Grant rejects duplicate (group, scope) in same draft.
- Grant allows same group at different scope.
- Grant allows same (group, scope) in different draft.
- UpdatePermissions rejects None.
- UpdatePermissions round-trips new flags + notes.
- UpdatePermissions on unknown rowid throws.

Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Admin.Tests csproj.

Full solution dotnet test: 1097 passing (was 1087, +10). Phase 6.2 total is
now 1087+10 = 1097; baseline 906 → +191 net across Phase 6.1 (all streams) +
Phase 6.2 (Streams A, B, C foundation, D data layer).

Stream D follow-up task tracks: RoleGrantsTab CRUD over LdapGroupRoleMapping,
AclsTab write-through + Probe-this-permission diagnostic, draft-diff ACL
section, SignalR PermissionTrieCache invalidation push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:39:06 -04:00
70f3ec0092 Merge pull request (#86) - Phase 6.2 Stream C foundation 2026-04-19 09:35:48 -04:00
Joseph Doherty
8efb99b6be Phase 6.2 Stream C (foundation) — AuthorizationGate + ILdapGroupsBearer
Lands the integration seam between the Server project's OPC UA stack and the
Core.Authorization evaluator. Actual DriverNodeManager dispatch-path wiring
(Read/Write/HistoryRead/Browse/Call/Subscribe/Alarm surfaces) lands in the
follow-up PR on this branch — covered by Task #143 below.

Server.Security additions:
- ILdapGroupsBearer — marker interface a custom IUserIdentity implements to
  expose its resolved LDAP group DNs. Parallel to the existing IRoleBearer
  (admin roles) — control/data-plane separation per decision #150.
- AuthorizationGate — stateless bridge between Opc.Ua.IUserIdentity and
  IPermissionEvaluator. IsAllowed(identity, operation, scope) materializes a
  UserAuthorizationState from the identity's LDAP groups, delegates to the
  evaluator, and returns a single bool the dispatch paths use to decide
  whether to surface BadUserAccessDenied.
- StrictMode knob controls fail-open-during-transition vs fail-closed:
  - strict=false (default during rollout) — null identity, identity without
    ILdapGroupsBearer, or NotGranted outcome all return true so older
    deployments without ACL data keep working.
  - strict=true (production target) — any of the above returns false.
  The appsetting `Authorization:StrictMode = true` flips deployments over
  once their ACL data is populated.

Tests (9 new in Server.Tests/AuthorizationGateTests):
- Null identity — strict denies, lax allows.
- Identity without LDAP groups — strict denies, lax allows.
- LDAP group with matching grant allows.
- LDAP group without grant — strict denies.
- Wrong operation denied (Read-only grant, WriteOperate requested).
- BuildSessionState returns materialized state with LDAP groups + null when
  identity doesn't carry them.

Full solution dotnet test: 1087 passing (Phase 6.1 = 1042, Phase 6.2 A = +9,
B = +27, C foundation = +9 = 1087). Pre-existing Client.CLI Subscribe flake
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:33:51 -04:00
f74e141e64 Merge pull request (#85) - Phase 6.2 Stream B 2026-04-19 09:29:51 -04:00
Joseph Doherty
40fb459040 Phase 6.2 Stream B — permission-trie evaluator in Core.Authorization
Ships Stream B.1-B.6 — the data-plane authorization engine Phase 6.2 runs on.
Integration into OPC UA dispatch (Stream C — Read / Write / HistoryRead /
Subscribe / Browse / Call etc.) is the next PR on this branch.

New Core.Abstractions:
- OpcUaOperation enum enumerates every OPC UA surface the evaluator gates:
  Browse, Read, WriteOperate/Tune/Configure (split by SecurityClassification),
  HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions,
  Call, AlarmAcknowledge/Confirm/Shelve. Stream C maps each one back to its
  dispatch call site.

New Core.Authorization namespace:
- NodeScope record + NodeHierarchyKind — 6-level scope addressing for
  Equipment-kind (UNS) namespaces, folder-segment walk for SystemPlatform-kind
  (Galaxy). NodeScope carries a Kind selector so the evaluator knows which
  hierarchy to descend.
- AuthorizationDecision { Verdict, Provenance } + AuthorizationVerdict
  {Allow, NotGranted, Denied} + MatchedGrant. Tri-state per decision #149;
  Phase 6.2 only produces Allow + NotGranted, Denied stays reserved for v2.1
  Explicit Deny without API break.
- IPermissionEvaluator.Authorize(session, operation, scope).
- PermissionTrie + PermissionTrieNode + TrieGrant. In-memory trie keyed on
  the ACL scope hierarchy. CollectMatches walks Cluster → Namespace →
  UnsArea → UnsLine → Equipment → Tag (or → FolderSegment(s) → Tag on
  Galaxy). Pure additive union — matches that share an LDAP group with the
  session contribute flags; OR across levels.
- PermissionTrieBuilder static factory. Build(clusterId, generationId, rows,
  scopePaths?) returns a trie for one generation. Cross-cluster rows are
  filtered out so the trie is cluster-coherent. Stream C follow-up wires a
  real scopePaths lookup from the live DB; tests supply hand-built paths.
- PermissionTrieCache — process-singleton, keyed on (ClusterId, GenerationId).
  Install(trie) adds a generation + promotes to "current" when the id is
  highest-known (handles out-of-order installs gracefully). Prior generations
  retained so an in-flight request against a prior trie still succeeds; GC
  via Prune(cluster, keepLatest).
- UserAuthorizationState — per-session cache of resolved LDAP groups +
  AuthGenerationId + MembershipVersion + MembershipResolvedUtc. Bounded by
  MembershipFreshnessInterval (default 15 min per decision #151) +
  AuthCacheMaxStaleness (default 5 min per decision #152).
- TriePermissionEvaluator — default IPermissionEvaluator. Fails closed on
  stale sessions (IsStale check short-circuits to NotGranted), on cross-
  cluster requests, on empty trie cache. Maps OpcUaOperation → NodePermissions
  via MapOperationToPermission (total — every enum value has a mapping; tested).

Tests (27 new, all pass):
- PermissionTrieTests (7): cluster-level grant cascades to every tag;
  equipment-level grant doesn't leak to sibling equipment; multi-group union
  ORs flags; no-matching-group returns empty; Galaxy folder-segment grant
  doesn't leak to sibling folder; cross-cluster rows don't land in this
  cluster's trie; build is idempotent (B.6 invariants).
- TriePermissionEvaluatorTests (8): allow when flag matches; NotGranted when
  no matching group; NotGranted when flags insufficient; HistoryRead requires
  its own bit (decision-level requirement); cross-cluster session denied;
  stale session fails closed; no cached trie denied; MapOperationToPermission
  is total across every OpcUaOperation.
- PermissionTrieCacheTests (8): empty cache returns null; install-then-get
  round-trips; new generation becomes current; out-of-order install doesn't
  downgrade current; invalidate drops one cluster; prune retains most recent;
  prune no-op when fewer than keep; cluster isolation.
- UserAuthorizationStateTests (4): fresh is not stale; IsStale after 5 min
  default; NeedsRefresh true between freshness + staleness windows.

Full solution dotnet test: 1078 passing (baseline 906, Phase 6.1 = 1042,
Phase 6.2 Stream A = +9, Stream B = +27 = 1078). Pre-existing Client.CLI
Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:27:44 -04:00
13a231b7ad Merge pull request (#84) - Phase 6.2 Stream A 2026-04-19 09:20:05 -04:00
Joseph Doherty
0fcdfc7546 Phase 6.2 Stream A — LdapGroupRoleMapping entity + EF migration + CRUD service
Stream A.1-A.2 per docs/v2/implementation/phase-6-2-authorization-runtime.md.
Seed-data migration (A.3) is a separate follow-up once production LDAP group
DNs are finalised; until then CRUD via the Admin UI handles the fleet set up.

Configuration:
- New AdminRole enum {ConfigViewer, ConfigEditor, FleetAdmin} — string-stored.
- New LdapGroupRoleMapping entity with Id (surrogate PK), LdapGroup (512 chars),
  Role (AdminRole enum), ClusterId (nullable, FK to ServerCluster), IsSystemWide,
  CreatedAtUtc, Notes.
- EF config: UX_LdapGroupRoleMapping_Group_Cluster unique index on
  (LdapGroup, ClusterId) + IX_LdapGroupRoleMapping_Group hot-path index on
  LdapGroup for sign-in lookups. Cluster FK cascades on cluster delete.
- Migration 20260419_..._AddLdapGroupRoleMapping generated via `dotnet ef`.

Configuration.Services:
- ILdapGroupRoleMappingService — CRUD surface. Declared as control-plane only
  per decision #150; the OPC UA data-path evaluator must NOT depend on this
  interface (Phase 6.2 compliance check on control/data-plane separation).
  GetByGroupsAsync is the hot-path sign-in lookup.
- LdapGroupRoleMappingService (EF Core impl) enforces the write-time invariant
  "exactly one of (ClusterId populated, IsSystemWide=true)" and surfaces
  InvalidLdapGroupRoleMappingException on violation. Create auto-populates Id
  + CreatedAtUtc when omitted.

Tests (9 new, all pass) in Configuration.Tests:
- Create sets Id + CreatedAtUtc.
- Create rejects empty LdapGroup.
- Create rejects IsSystemWide=true with populated ClusterId.
- Create rejects IsSystemWide=false with null ClusterId.
- GetByGroupsAsync returns matching rows only.
- GetByGroupsAsync with empty input returns empty (no full-table scan).
- ListAllAsync orders by group then cluster.
- Delete removes the target row.
- Delete of unknown id is a no-op.

Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Configuration.Tests for
the service-level tests (schema-compliance tests still use the live SQL
fixture).

SchemaComplianceTests updated to expect the new LdapGroupRoleMapping table.

Full solution dotnet test: 1051 passing (baseline 906, Phase 6.1 shipped at
1042, Phase 6.2 Stream A adds 9 = 1051). Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:18:06 -04:00
1650c6c550 Merge pull request (#83) - Phase 6.1 exit gate 2026-04-19 08:55:47 -04:00
Joseph Doherty
f29043c66a Phase 6.1 exit gate — compliance script real-checks + phase doc status = SHIPPED
scripts/compliance/phase-6-1-compliance.ps1 replaces the stub TODOs with 34
real checks covering:
- Stream A: pipeline builder + CapabilityInvoker + WriteIdempotentAttribute
  present; pipeline key includes HostName (per-device isolation per decision
  #144); OnReadValue / OnWriteValue / HistoryRead route through invoker in
  DriverNodeManager; Galaxy supervisor CircuitBreaker + Backoff preserved.
- Stream B: DriverTier enum; DriverTypeMetadata requires Tier; MemoryTracking
  + MemoryRecycle (Tier C-gated) + ScheduledRecycleScheduler (rejects Tier
  A/B) + demand-aware WedgeDetector all present.
- Stream C: DriverHealthReport + HealthEndpointsHost; state matrix Healthy=200
  / Faulted=503 asserted in code; LogContextEnricher; JSON sink opt-in via
  Serilog:WriteJson.
- Stream D: GenerationSealedCache + ReadOnly marking + GenerationCacheUnavailable
  exception path; ResilientConfigReader + StaleConfigFlag.
- Stream E data layer: DriverInstanceResilienceStatus entity +
  DriverResilienceStatusTracker. SignalR/Blazor surface is Deferred per the
  visual-compliance follow-up pattern borrowed from Phase 6.4.
- Cross-cutting: full solution `dotnet test` runs; asserts 1042 >= 906
  baseline; tolerates the one pre-existing Client.CLI Subscribe flake and
  flags any new failure.

Running the script locally returns "Phase 6.1 compliance: PASS" — exit 0. Any
future regression that deletes a class or un-wires a dispatch path turns a
green check red + exit non-zero.

docs/v2/implementation/phase-6-1-resilience-and-observability.md status
updated from DRAFT to SHIPPED with the merged-PRs summary + test count delta +
the single deferred follow-up (visual review of the Admin /hosts columns).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:53:47 -04:00
a7f34a4301 Merge pull request (#82) - Phase 6.1 Stream E data layer 2026-04-19 08:49:43 -04:00
Joseph Doherty
cbcaf6593a Phase 6.1 Stream E (data layer) — DriverInstanceResilienceStatus entity + DriverResilienceStatusTracker + EF migration
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts
page refresh (E.2-E.3) are follow-up work paired with the visual-compliance
review per Phase 6.4 patterns — documented as a deferred follow-up below.

Configuration:
- New entity DriverInstanceResilienceStatus with:
  DriverInstanceId, HostName (composite PK),
  LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth,
  LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes,
  LastSampledUtc.
- Separate from DriverHostStatus (per-host connectivity view) so a Running
  host that has tripped its breaker or is nearing its memory ceiling shows up
  distinctly on Admin /hosts. Admin page left-joins both for display.
- OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled
  index for the stale-sample filter query.
- EF migration: 20260419124034_AddDriverInstanceResilienceStatus.

Core.Resilience:
- DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on
  (DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking +
  MemoryRecycle callers record failure/success/breaker-open/recycle/footprint
  events; a HostedService (Stream E.2 follow-up) samples this tracker every
  5 s and persists to the DB. Pure in-memory keeps tests fast + the core
  free of EF/SQL dependencies.

Tests:
- DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write
  returns null; failures accumulate; success resets; breaker/recycle/footprint
  fields populate; per-host isolation; snapshot returns all pairs; concurrent
  writes don't lose counts.
- SchemaComplianceTests: expected-tables list updated to include the new
  DriverInstanceResilienceStatus table.

Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so
far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake
unchanged.

Deferred to follow-up PR (E.2/E.3):
- ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker
  every 5 s + upserts DriverInstanceResilienceStatus rows.
- Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc /
  CurrentBulkheadDepth / LastRecycleUtc on change.
- Admin /hosts Blazor column additions (red badge when
  ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer
  signoff alongside Phase 6.4 admin-ui patterns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:47:43 -04:00
8d81715079 Merge pull request (#81) - Phase 6.1 Stream D 2026-04-19 08:35:33 -04:00
Joseph Doherty
854c3bcfec Phase 6.1 Stream D — LiteDB generation-sealed config cache + ResilientConfigReader + UsingStaleConfig flag
Closes Stream D per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

New Configuration.LocalCache types (alongside the existing single-file
LiteDbConfigCache):

- GenerationSealedCache — file-per-generation sealed snapshots per decision
  #148. Each SealAsync writes <cache-root>/<clusterId>/<generationId>.db as a
  read-only LiteDB file, then atomically publishes the CURRENT pointer via
  temp-file + File.Replace. Prior-generation files stay on disk for audit.
  Mixed-generation reads are structurally impossible: ReadCurrentAsync opens
  the single file named by CURRENT. Corruption of the pointer or the sealed
  file raises GenerationCacheUnavailableException — fails closed, never falls
  back silently to an older generation. TryGetCurrentGenerationId returns the
  pointer value or null for diagnostics.

- StaleConfigFlag — thread-safe (Volatile.Read/Write) bool. MarkStale when a
  read fell back to the cache; MarkFresh when a central-DB read succeeded.
  Surfaced on /healthz body and Admin /hosts (Stream C wiring already in
  place).

- ResilientConfigReader — wraps a central-DB fetch function with the Stream
  D.2 pipeline: timeout 2 s → retry N× jittered (skipped when retryCount=0) →
  fallback to the sealed cache. Toggles StaleConfigFlag per outcome. Read path
  only — the write path is expected to bypass this wrapper and fail hard on
  DB outage so inconsistent writes never land. Cancellation passes through
  and is NOT retried.

Configuration.csproj:
- Polly.Core 8.6.6 + Microsoft.Extensions.Logging.Abstractions added.

Tests (17 new, all pass):
- GenerationSealedCacheTests (10): first-boot-no-snapshot throws
  GenerationCacheUnavailableException (D.4 scenario C), seal-then-read round
  trip, sealed file is ReadOnly on disk, pointer advances to latest, prior
  generation file preserved, corrupt sealed file fails closed, missing sealed
  file fails closed, corrupt pointer fails closed (D.4 scenario B), same
  generation sealed twice is idempotent, independent clusters don't
  interfere.
- ResilientConfigReaderTests (4): central-DB success returns value + marks
  fresh; central-DB failure exhausts retries + falls back to cache + marks
  stale (D.4 scenario A); central-DB + cache both unavailable throws;
  cancellation not retried.
- StaleConfigFlagTests (3): default is fresh; toggles; concurrent writes
  converge.

Full solution dotnet test: 1033 passing (baseline 906, +127 net across Phase
6.1 Streams A/B/C/D). Pre-existing Client.CLI Subscribe flake unchanged.

Integration into Configuration read paths (DriverInstance enumeration,
LdapGroupRoleMapping fetches, etc.) + the sp_PublishGeneration hook that
writes sealed files lands in the Phase 6.1 Stream E / Admin-refresh PR where
the DB integration surfaces are already touched. Existing LiteDbConfigCache
continues serving its single-file role for the NodeBootstrap path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:33:32 -04:00
ff4a74a81f Merge pull request (#80) - Phase 6.1 Stream C 2026-04-19 08:17:49 -04:00
Joseph Doherty
9dd5e4e745 Phase 6.1 Stream C — health endpoints on :4841 + LogContextEnricher + Serilog JSON sink + CapabilityInvoker enrichment
Closes Stream C per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

Core.Observability (new namespace):
- DriverHealthReport — pure-function aggregation over DriverHealthSnapshot list.
  Empty fleet = Healthy. Any Faulted = Faulted. Any Unknown/Initializing (no
  Faulted) = NotReady. Any Degraded or Reconnecting (no Faulted, no NotReady)
  = Degraded. Else Healthy. HttpStatus(verdict) maps to the Stream C.1 state
  matrix: Healthy/Degraded → 200, NotReady/Faulted → 503.
- LogContextEnricher — Serilog LogContext wrapper. Push(id, type, capability,
  correlationId) returns an IDisposable scope; inner log calls carry
  DriverInstanceId / DriverType / CapabilityName / CorrelationId structured
  properties automatically. NewCorrelationId = 12-hex-char GUID slice for
  cases where no OPC UA RequestHeader.RequestHandle is in flight.

CapabilityInvoker — now threads LogContextEnricher around every ExecuteAsync /
ExecuteWriteAsync call site. OtOpcUaServer passes driver.DriverType through
so logs correlate to the driver type too. Every capability call emits
structured fields per the Stream C.4 compliance check.

Server.Observability:
- HealthEndpointsHost — standalone HttpListener on http://localhost:4841/
  (loopback avoids Windows URL-ACL elevation; remote probing via reverse
  proxy or explicit netsh urlacl grant). Routes:
    /healthz → 200 when (configDbReachable OR usingStaleConfig); 503 otherwise.
      Body: status, uptimeSeconds, configDbReachable, usingStaleConfig.
    /readyz  → DriverHealthReport.Aggregate + HttpStatus mapping.
      Body: verdict, drivers[], degradedDrivers[], uptimeSeconds.
    anything else → 404.
  Disposal cooperative with the HttpListener shutdown.
- OpcUaApplicationHost starts the health host after the OPC UA server comes up
  and disposes it on shutdown. New OpcUaServerOptions knobs:
  HealthEndpointsEnabled (default true), HealthEndpointsPrefix (default
  http://localhost:4841/).

Program.cs:
- Serilog pipeline adds Enrich.FromLogContext + opt-in JSON file sink via
  `Serilog:WriteJson = true` appsetting. Uses Serilog.Formatting.Compact's
  CompactJsonFormatter (one JSON object per line — SIEMs like Splunk,
  Datadog, Graylog ingest without a regex parser).

Server.Tests:
- Existing 3 OpcUaApplicationHost integration tests now set
  HealthEndpointsEnabled=false to avoid port :4841 collisions under parallel
  execution.
- New HealthEndpointsHostTests (9): /healthz healthy empty fleet; stale-config
  returns 200 with flag; unreachable+no-cache returns 503; /readyz empty/
  Healthy/Faulted/Degraded/Initializing drivers return correct status and
  bodies; unknown path → 404. Uses ephemeral ports via Interlocked counter.

Core.Tests:
- DriverHealthReportTests (8): empty fleet, all-healthy, any-Faulted trumps,
  any-NotReady without Faulted, Degraded without Faulted/NotReady, HttpStatus
  per-verdict theory.
- LogContextEnricherTests (8): all 4 properties attach; scope disposes cleanly;
  NewCorrelationId shape; null/whitespace driverInstanceId throws.
- CapabilityInvokerEnrichmentTests (2): inner logs carry structured
  properties; no context leak outside the call site.

Full solution dotnet test: 1016 passing (baseline 906, +110 for Phase 6.1 so
far across Streams A+B+C). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:15:44 -04:00
6b3a67fd9e Merge pull request (#79) - Phase 6.1 Stream B - Tier A/B/C stability (registry + MemoryTracking + MemoryRecycle + Scheduler + WedgeDetector) 2026-04-19 08:05:03 -04:00
Joseph Doherty
1d9008e354 Phase 6.1 Stream B.3/B.4/B.5 — MemoryRecycle + ScheduledRecycleScheduler + demand-aware WedgeDetector
Closes out Stream B per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

Core.Abstractions:
- IDriverSupervisor — process-level supervisor contract a Tier C driver's
  out-of-process topology provides (Galaxy Proxy/Supervisor implements this in
  a follow-up Driver.Galaxy wiring PR). Concerns: DriverInstanceId + RecycleAsync.
  Tier A/B drivers don't implement this; Stream B code asserts tier == C before
  ever calling it.

Core.Stability:
- MemoryRecycle — companion to MemoryTracking. On HardBreach, invokes the
  supervisor IFF tier == C AND a supervisor is wired. Tier A/B HardBreach logs
  a promotion-to-Tier-C recommendation and returns false. Soft/None/Warming
  never triggers a recycle at any tier.
- ScheduledRecycleScheduler — Tier C opt-in periodic recycler per decision #67.
  Ctor throws for Tier A/B (structural guard — scheduled recycle on an
  in-process driver would kill every OPC UA session and every co-hosted
  driver). TickAsync(now) advances the schedule by one interval per fire;
  RequestRecycleNowAsync drives an ad-hoc recycle without shifting the cron.
- WedgeDetector — demand-aware per decision #147. Classify(state, demand, now)
  returns:
    * NotApplicable when driver state != Healthy
    * Idle when Healthy + no pending work (bulkhead=0 && monitored=0 && historic=0)
    * Healthy when Healthy + pending work + progress within threshold
    * Faulted when Healthy + pending work + no progress within threshold
  Threshold clamps to min 60 s. DemandSignal.HasPendingWork ORs the three counters.
  The three false-wedge cases the plan calls out all stay Healthy: idle
  subscription-only, slow historian backfill making progress, write-only burst
  with drained bulkhead.

Tests (22 new, all pass):
- MemoryRecycleTests (7): Tier C hard-breach requests recycle; Tier A/B
  hard-breach never requests; Tier C without supervisor no-ops; soft-breach
  at every tier never requests; None/Warming never request.
- ScheduledRecycleSchedulerTests (6): ctor throws for A/B; zero/negative
  interval throws; tick before due no-ops; tick at/after due fires once and
  advances; RequestRecycleNow fires immediately without shifting schedule;
  multiple fires across ticks advance one interval each.
- WedgeDetectorTests (9): threshold clamp to 60 s; unhealthy driver always
  NotApplicable; idle subscription stays Idle; pending+fresh progress stays
  Healthy; pending+stale progress is Faulted; MonitoredItems active but no
  publish is Faulted; MonitoredItems active with fresh publish stays Healthy;
  historian backfill with fresh progress stays Healthy; write-only burst with
  empty bulkhead is Idle; HasPendingWork theory for any non-zero counter.

Full solution dotnet test: 989 passing (baseline 906, +83 for Phase 6.1 so far).
Pre-existing Client.CLI Subscribe flake unchanged.

Stream B complete. Next up: Stream C (health endpoints + structured logging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:03:18 -04:00
Joseph Doherty
ef6b0bb8fc Phase 6.1 Stream B.1/B.2 — DriverTier on DriverTypeMetadata + Core.Stability.MemoryTracking with hybrid-formula soft/hard thresholds
Stream B.1 — registry invariant:
- DriverTypeMetadata gains a required `DriverTier Tier` field. Every registered
  driver type must declare its stability tier so the downstream MemoryTracking,
  MemoryRecycle, and resilience-policy layers can resolve the right defaults.
  Stamped-at-registration-time enforcement makes the "every driver type has a
  non-null Tier" compliance check structurally impossible to fail.
- DriverTypeRegistry API unchanged; one new property on the record.

Stream B.2 — MemoryTracking (Core.Stability):
- Tier-agnostic tracker per decision #146: captures baseline as the median of
  samples collected during a post-init warmup window (default 5 min), then
  classifies each subsequent sample with the hybrid formula
  `soft = max(multiplier × baseline, baseline + floor)`, `hard = 2 × soft`.
- Per-tier constants wired: Tier A mult=3 floor=50 MB, Tier B mult=3 floor=100 MB,
  Tier C mult=2 floor=500 MB.
- Never kills. Hard-breach action returns HardBreach; the supervisor that acts
  on that signal (MemoryRecycle) is Tier C only per decisions #74, #145 and
  lands in the next B.3 commit on this branch.
- Two phases: WarmingUp (samples collected, Warming returned) and Steady
  (baseline captured, soft/hard checks active). Transition is automatic when
  the warmup window elapses.

Tests (15 new, all pass):
- Warming phase returns Warming until the window elapses.
- Window-elapsed captures median baseline + transitions to Steady.
- Per-tier constants match decision #146 table exactly.
- Soft threshold uses max() — small baseline → floor wins; large baseline →
  multiplier wins.
- Hard = 2 × soft.
- Sample below soft = None; at soft = SoftBreach; at/above hard = HardBreach.
- DriverTypeRegistry: theory asserts Tier round-trips for A/B/C.

Full solution dotnet test: 963 passing (baseline 906, +57 net for Phase 6.1
Stream A + Stream B.1/B.2). Pre-existing Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:37:43 -04:00
a06fcb16a2 Merge pull request (#78) - Phase 6.1 Stream A - Polly resilience + CapabilityInvoker + Read/Write/HistoryRead dispatch wrapping 2026-04-19 07:33:53 -04:00
Joseph Doherty
d2f3a243cd Phase 6.1 Stream A.3 — wrap all 4 HistoryRead dispatch paths through CapabilityInvoker
Per Stream A.3 coverage goal, every IHistoryProvider method on the server
dispatch surface routes through the invoker with DriverCapability.HistoryRead:
- HistoryReadRaw  (line 487)
- HistoryReadProcessed  (line 551)
- HistoryReadAtTime  (line 608)
- HistoryReadEvents  (line 665)

Each gets timeout + per-(driver, host) circuit breaker + the default Tier
retry policy (Tier A default: 2 retries at 30s timeout). Inner driver
GetAwaiter().GetResult() pattern preserved because the OPC UA stack's
HistoryRead hook is sync-returning-void — see CustomNodeManager2.

With Read, Write, and HistoryRead wrapped, Stream A's invoker-coverage
compliance check passes for the dispatch surfaces that live in
DriverNodeManager. Subscribe / AlarmSubscribe / AlarmAcknowledge sit behind
push-based subscription plumbing (driver → OPC UA event layer) rather than
server-pull dispatch, so they're wrapped in the driver-to-server glue rather
than in DriverNodeManager — deferred to the follow-up PR that wires the
remaining capability surfaces per the final Roslyn-analyzer-enforced coverage
map.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:32:10 -04:00
Joseph Doherty
29bcaf277b Phase 6.1 Stream A.3 complete — wire CapabilityInvoker into DriverNodeManager dispatch end-to-end
Every OnReadValue / OnWriteValue now routes through the process-singleton
DriverResiliencePipelineBuilder's CapabilityInvoker. Read / Write dispatch
paths gain timeout + per-capability retry + per-(driver, host) circuit breaker
+ bulkhead without touching the individual driver implementations.

Wiring:
- OpcUaApplicationHost: new optional DriverResiliencePipelineBuilder ctor
  parameter (default null → instance-owned builder). Keeps the 3 test call
  sites that construct OpcUaApplicationHost directly unchanged.
- OtOpcUaServer: requires the builder in its ctor; constructs one
  CapabilityInvoker per driver at CreateMasterNodeManager time with default
  Tier A DriverResilienceOptions. TODO: Stream B.1 will wire real per-driver-
  type tiers via DriverTypeRegistry; Phase 6.1 follow-up will read the
  DriverInstance.ResilienceConfig JSON column for per-instance overrides.
- DriverNodeManager: takes a CapabilityInvoker in its ctor. OnReadValue wraps
  the driver's ReadAsync through ExecuteAsync(DriverCapability.Read, hostName,
  ...); OnWriteValue wraps WriteAsync through ExecuteWriteAsync(hostName,
  isIdempotent, ...) where isIdempotent comes from the new
  _writeIdempotentByFullRef map populated at Variable() registration from
  DriverAttributeInfo.WriteIdempotent.

HostName defaults to driver.DriverInstanceId for now — a single-host pipeline
per driver. Multi-host drivers (Modbus with N PLCs) will expose their own per-
call host resolution in a follow-up so failing PLCs can trip per-PLC breakers
without poisoning siblings (decision #144).

Test fixup:
- FlakeyDriverIntegrationTests.Read_SurfacesSuccess_AfterTransientFailures:
  bumped TimeoutSeconds=2 → 30. 10 retries at exponential backoff with jitter
  can exceed 2s under parallel-test-run CPU pressure; the test asserts retry
  behavior, not timeout budget, so the longer slack keeps it deterministic.

Full solution dotnet test: 948 passing. Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:28:28 -04:00
Joseph Doherty
b6d2803ff6 Phase 6.1 Stream A — switch pipeline keys from Guid to string to match IDriver.DriverInstanceId
IDriver.DriverInstanceId is declared as string in Core.Abstractions; keeping
the pipeline key as Guid meant every call site would need .ToString() / Guid.Parse
at the boundary. Switching the Resilience types to string removes that friction
and lets OtOpcUaServer pass driver.DriverInstanceId directly to the builder in
the upcoming server-dispatch wiring PR.

- DriverResiliencePipelineBuilder.GetOrCreate + Invalidate + PipelineKey
- CapabilityInvoker.ctor + _driverInstanceId field

Tests: all 48 Core.Tests still pass. The Invalidate test's keepId / dropId now
use distinct "drv-keep" / "drv-drop" literals (previously both were distinct
Guid.NewGuid() values, which the sed-driven refactor had collapsed to the same
literal — caught pre-commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:18:55 -04:00
Joseph Doherty
f3850f8914 Phase 6.1 Stream A.5/A.6 — WriteIdempotent flag on DriverAttributeInfo + Modbus/S7 tag records + FlakeyDriver integration tests
Per-tag opt-in for write-retry per docs/v2/plan.md decisions #44, #45, #143.
Default is false — writes never auto-retry unless the driver author has marked
the tag as safe to replay.

Core.Abstractions:
- DriverAttributeInfo gains `bool WriteIdempotent = false` at the end of the
  positional record (back-compatible; every existing call site uses the default).

Driver.Modbus:
- ModbusTagDefinition gains `bool WriteIdempotent = false`. Safe candidates
  documented in the param XML: holding-register set-points, configuration
  registers. Unsafe: edge-triggered coils, counter-increment addresses.
- ModbusDriver.DiscoverAsync propagates t.WriteIdempotent into
  DriverAttributeInfo.WriteIdempotent.

Driver.S7:
- S7TagDefinition gains `bool WriteIdempotent = false`. Safe candidates:
  DB word/dword set-points, configuration DBs. Unsafe: M/Q bits that drive
  edge-triggered program routines.
- S7Driver.DiscoverAsync propagates the flag.

Stream A.5 integration tests (FlakeyDriverIntegrationTests, 4 new) exercise
the invoker + flaky-driver contract the plan enumerates:
- Read with 5 transient failures succeeds on the 6th attempt (RetryCount=10).
- Non-idempotent write with RetryCount=5 configured still fails on the first
  failure — no replay (decision #44 guard at the ExecuteWriteAsync surface).
- Idempotent write with 2 transient failures succeeds on the 3rd attempt.
- Two hosts on the same driver have independent breakers — dead-host trips
  its breaker but live-host's first call still succeeds.

Propagation tests:
- ModbusDriverTests: SetPoint WriteIdempotent=true flows into
  DriverAttributeInfo; PulseCoil default=false.
- S7DiscoveryAndSubscribeTests: same pattern for DBx SetPoint vs M-bit.

Full solution dotnet test: 947 passing (baseline 906, +41 net across Stream A
so far). Pre-existing Client.CLI Subscribe flake unchanged.

Stream A's remaining work (wiring CapabilityInvoker into DriverNodeManager's
OnReadValue / OnWriteValue / History / Subscribe dispatch paths) is the
server-side integration piece + needs DI wiring for the pipeline builder —
lands in the next PR on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:16:21 -04:00
Joseph Doherty
90f7792c92 Phase 6.1 Stream A.3 — CapabilityInvoker wraps driver-capability calls through the shared pipeline
One invoker per (DriverInstance, IDriver) pair; calls ExecuteAsync(capability,
host, callSite) and the invoker resolves the correct pipeline from the shared
DriverResiliencePipelineBuilder. The options accessor is a Func so Admin-edit
+ pipeline-invalidate takes effect without restarting the invoker or the
driver host.

ExecuteWriteAsync(isIdempotent) is the explicit write-safety surface:
- isIdempotent=false routes through a side pipeline with RetryCount=0 regardless
  of what the caller configured. The cache key carries a "::non-idempotent"
  suffix so it never collides with the retry-enabled write pipeline.
- isIdempotent=true routes through the normal Write pipeline. If the user has
  configured Write retries (opt-in), the idempotent tag gets them; otherwise
  default-0 still wins.

The server dispatch layer (next PR) reads WriteIdempotentAttribute on each tag
definition once at driver-init time and feeds the boolean into ExecuteWriteAsync.

Tests (6 new):
- Read retries on transient failure; returns value from call site.
- Write non-idempotent does NOT retry even when policy has 3 retries configured
  (the explicit decision-#44 guard at the dispatch surface).
- Write idempotent retries when policy allows.
- Write with default tier-A policy (RetryCount=0) never retries regardless of
  idempotency flag.
- Different hosts get independent pipelines.

Core.Tests now 44 passing (was 38). Invoker doc-refs completed (the XML comment
on WriteIdempotentAttribute no longer references a non-existent type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:09:26 -04:00
Joseph Doherty
c04b13f436 Phase 6.1 Stream A.1/A.2/A.6 — Polly resilience foundation: pipeline builder + per-tier policy defaults + WriteIdempotent attribute
Lands the first chunk of the Phase 6.1 Stream A resilience layer per
docs/v2/implementation/phase-6-1-resilience-and-observability.md §Stream A.
Downstream CapabilityInvoker (A.3) + driver-dispatch wiring land in follow-up
PRs on the same branch.

Core.Abstractions additions:
- WriteIdempotentAttribute — marker for tag-definition records that opt into
  auto-retry on IWritable.WriteAsync. Absence = no retry per decisions #44, #45,
  #143. Read once via reflection at driver-init time; no per-write cost.
- DriverCapability enum — enumerates the 8 capability surface points
  (Read / Write / Discover / Subscribe / Probe / AlarmSubscribe / AlarmAcknowledge
  / HistoryRead). AlarmAcknowledge is write-shaped (no retry by default).
- DriverTier enum — A/B/C per driver-stability.md §2-4. Stream B.1 wires this
  into DriverTypeMetadata; surfaced here because the resilience policy defaults
  key on it.

Core.Resilience new namespace:
- DriverResilienceOptions — per-tier × per-capability policy defaults.
  GetTierDefaults(tier) is the source of truth:
    * Tier A: Read 2s/3 retries, Write 2s/0 retries, breaker threshold 5
    * Tier B: Read 4s/3, Write 4s/0, breaker threshold 5
    * Tier C: Read 10s/1, Write 10s/0, breaker threshold 0 (supervisor handles
      process-level breaker per decision #68)
  Resolve(capability) overlays CapabilityPolicies on top of the defaults.
- DriverResiliencePipelineBuilder — composes Timeout → Retry (capability-
  permitting, never on cancellation) → CircuitBreaker (tier-permitting) →
  Bulkhead. Pipelines cached in a lock-free ConcurrentDictionary keyed on
  (DriverInstanceId, HostName, DriverCapability) per decision #144 — one dead
  PLC behind a multi-device driver does not open the breaker for healthy
  siblings. Invalidate(driverInstanceId) supports Admin-triggered reload.

Tests (30 new, all pass):
- DriverResilienceOptionsTests: tier-default coverage for every capability,
  Write + AlarmAcknowledge never retry at any tier, Tier C disables breaker,
  resolve-with-override layering.
- DriverResiliencePipelineBuilderTests: Read retries transients, Write does NOT
  retry on failure (decision #44 guard), dead-host isolation from sibling hosts,
  pipeline reuse for same triple, per-capability isolation, breaker opens after
  threshold on Tier A, timeout fires, cancellation is not retried,
  invalidation scoped to matching instance.

Polly.Core 8.6.6 added to Core.csproj. Full solution dotnet test: 936 passing
(baseline 906 + 30 new). One pre-existing Client.CLI Subscribe flake unchanged
by this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 04:07:27 -04:00
6a30f3dde7 Merge pull request (#77) - Phase 6 reconcile 2026-04-19 03:52:25 -04:00
Joseph Doherty
ba31f200f6 Phase 6 reconcile — merge adjustments into plan bodies, add decisions #143-162, scaffold compliance stubs
After shipping the four Phase 6 plan drafts (PRs 77-80), the adversarial-review
adjustments lived only as trailing "Review" sections. An implementer reading
Stream A would find the original unadjusted guidance, then have to cross-reference
the review to reconcile. This PR makes the plans genuinely executable:

1. Merges every ACCEPTed review finding into the actual Scope / Stream / Compliance
   sections of each phase plan:
   - phase-6-1: Scope table rewrite (per-capability retry, (instance,host) pipeline key,
     MemoryTracking vs MemoryRecycle split, hybrid watchdog formula, demand-aware
     wedge detector, generation-sealed LiteDB). Streams A/B/D + Compliance rewritten.
   - phase-6-2: AuthorizationDecision tri-state, control/data-plane separation,
     MembershipFreshnessInterval (15 min), AuthCacheMaxStaleness (5 min),
     subscription stamp-and-reevaluate. Stream C widened to 11 OPC UA operations.
   - phase-6-3: 8-state ServiceLevel matrix (OPC UA Part 5 §6.3.34-compliant),
     two-layer peer probe (/healthz + UaHealthProbe), apply-lease via await using,
     publish-generation fencing, InvalidTopology runtime state, ServerUriArray
     self-first + peers. New Stream F (interop matrix + Galaxy failover).
   - phase-6-4: DraftRevisionToken concurrency control, staged-import via
     EquipmentImportBatch with user-scoped visibility, CSV header version marker,
     decision-#117-aligned identifier columns, 1000-row diff cap,
     decision-#139 OPC 40010 fields, Identification inherits Equipment ACL.

2. Appends decisions #143 through #162 to docs/v2/plan.md capturing the
   architectural commitments the adjustments created. Each decision carries its
   dated rationale so future readers know why the choice was made.

3. Scaffolds scripts/compliance/phase-6-{1,2,3,4}-compliance.ps1 — PowerShell
   stubs with Assert-Todo / Assert-Pass / Assert-Fail helpers. Every check
   maps to a Stream task ID from the corresponding phase plan. Currently all
   checks are TODO and scripts exit 0; each implementation task is responsible
   for replacing its TODO with a real check before closing that task. Saved
   as UTF-8 with BOM so Windows PowerShell 5.1 parses em-dash characters
   without breaking.

Net result: the Phase 6.1 plan is genuinely ready to execute. Stream A.3 can
start tomorrow without reconciling Streams vs. Review on every task; the
compliance script is wired to the Stream IDs; plan.md has the architectural
commitments that justify the Stream choices.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:49:41 -04:00
81a1f7f0f6 Merge pull request 'Phase 6 — Four implementation plans for unplanned v2 features, each with codex adversarial review' (#76) from phase-6-plans-drafts into v2 2026-04-19 03:17:16 -04:00
Joseph Doherty
4695a5c88e Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes. 2026-04-19 03:15:00 -04:00
0109fab4bf Merge pull request 'Phase 3 PR 76 -- OPC UA Client IHistoryProvider' (#75) from phase-3-pr76-opcua-client-history into v2 2026-04-19 02:15:31 -04:00
Joseph Doherty
c9e856178a Phase 3 PR 76 -- OPC UA Client IHistoryProvider (HistoryRead passthrough). Driver now implements IHistoryProvider (Raw + Processed + AtTime); ReadEventsAsync deliberately inherits the interface default that throws NotSupportedException. ExecuteHistoryReadAsync is the shared wire path: parses the fullReference to NodeId, builds a HistoryReadValueIdCollection with one entry, calls Session.HistoryReadAsync(RequestHeader, ExtensionObject<details>, TimestampsToReturn.Both, releaseContinuationPoints:false, nodesToRead, ct), unwraps r.HistoryData ExtensionObject into the samples list, passes ContinuationPoint through. Each DataValue's upstream StatusCode + SourceTimestamp + ServerTimestamp preserved verbatim per driver-specs.md \u00A78 cascading-quality rule -- this matters especially for historical data where an interpolated / uncertain-quality sample must surface its true severity downstream, not a sanitized Good. SourceTimestamp=DateTime.MinValue guards map to null so downstream clients see 'source unknown' rather than an epoch-zero misread. ReadRawAsync builds ReadRawModifiedDetails with IsReadModified=false (raw, not modified-history), StartTime/EndTime, NumValuesPerNode=maxValuesPerNode, ReturnBounds=false (clients that want bounds request them via continuation handling). ReadProcessedAsync builds ReadProcessedDetails with ProcessingInterval in ms + AggregateType wrapping a single NodeId from MapAggregateToNodeId. MapAggregateToNodeId switches on HistoryAggregateType {Average, Minimum, Maximum, Total, Count} to the standard Part 13 ObjectIds.AggregateFunction_* NodeId -- future aggregate-type additions fail the switch with ArgumentOutOfRangeException so they can't silently slip through with a null NodeId and an opaque server-side BadAggregateNotSupported. ReadAtTimeAsync builds ReadAtTimeDetails with ReqTimes + UseSimpleBounds=true (returns boundary samples when an exact timestamp has no value -- the OPC UA Part 11 default). Malformed NodeId short-circuits to empty result without touching the wire, matching the ReadAsync / WriteAsync pattern. ReadEventsAsync stays at the interface-default NotSupportedException: the OPC UA call path (HistoryReadAsync with ReadEventDetails + EventFilter) needs an EventFilter SelectClauses spec which the current IHistoryProvider.ReadEventsAsync signature doesn't carry. Adding that would be an IHistoryProvider interface widening; out of scope for PR 76. Callers see BadHistoryOperationUnsupported on the OPC UA client which is the documented fallback. Name disambiguation: Core.Abstractions.HistoryReadResult and Opc.Ua.HistoryReadResult both exist; used fully-qualified Core.Abstractions.HistoryReadResult in return types + factory expressions. Shutdown unchanged -- history reads don't create persistent server-side resources, so no cleanup needed beyond the existing Session.CloseAsync. Unit tests (OpcUaClientHistoryTests, 7 facts): MapAggregateToNodeId theory covers all 5 aggregates; MapAggregateToNodeId_rejects_invalid_enum (defense against future enum addition silently passing through); Read{Raw,Processed,AtTime}Async_without_initialize_throws (RequireSession path); ReadEventsAsync_throws_NotSupportedException (locks in the intentional inheritance of the default). 78/78 OpcUaClient.Tests pass (67 prior + 11 new, -4 on the alarm suite moved into the events count). dotnet build clean. Final OPC UA Client capability surface: IDriver + ITagDiscovery + IReadable + IWritable + ISubscribable + IHostConnectivityProbe + IAlarmSource + IHistoryProvider -- 8 of 8 possible capabilities. Driver is feature-complete per driver-specs.md \u00A78. 2026-04-19 02:13:22 -04:00
63eb569fd6 Merge pull request 'Phase 3 PR 75 -- OPC UA Client IAlarmSource' (#74) from phase-3-pr75-opcua-client-alarms into v2 2026-04-19 02:11:10 -04:00
Joseph Doherty
fad04bbdf7 Phase 3 PR 75 -- OPC UA Client IAlarmSource (A&C event forwarding + Acknowledge). Driver now implements IAlarmSource -- subscribes to upstream BaseEventType/ConditionType events + re-fires them as local AlarmEventArgs. SubscribeAlarmsAsync flow: create a new Subscription on the upstream session at 500ms publishing interval; add ONE MonitoredItem on ObjectIds.Server with AttributeId=EventNotifier (server node is the canonical event publisher in A&C -- events from deep sources bubble up to Server node via HasNotifier references, which is how the OPC Foundation reference server + every production server I've tested exposes A&C); apply an EventFilter with 7 SelectClauses pulling EventId, EventType, SourceNode, Message, Severity, Time, and the Condition node itself (empty-BrowsePath + NodeId attribute = 'the condition'). Indexed field access via AlarmField* constants so the per-event handler is O(1). Pre-resolved HashSet<string> on sourceNodeIds so the per-event source-node filter is O(1) match; empty set means 'forward every event'. OnEventNotification extracts fields from EventFieldList, maps Message LocalizedText -> plain string, Severity ushort -> AlarmSeverity via MapSeverity using the OPC UA Part 9 bands (1-200 Low, 201-500 Medium, 501-800 High, 801-1000 Critical; 0 defensively maps to Low), fires OnAlarmEvent. Queue size 1000 + DiscardOldest=false so bursts (e.g. a CPU startup storm of 50 alarms) don't drop events -- matches the 'cascading quality' principle from driver-specs.md \u00A78 where the driver must not silently lose upstream state. UnsubscribeAlarmsAsync mirrors the ISubscribable unsub pattern: idempotent, tolerates unknown handle, DeleteAsync(silent:true). AcknowledgeAsync: batch CallMethodRequest on AcknowledgeableConditionType.Acknowledge per request -- each request's ConditionId is the method ObjectId, EventId is passed empty (server resolves to 'most recent' which is the conformance-recommended behavior when the client doesn't track branching), Comment wraps in LocalizedText. Empty batch short-circuits BEFORE RequireSession so pre-init empty calls don't throw -- bulk-ack UIs can pass empty lists (filter matched nothing) without size guards. Shutdown path also tears down alarm subscriptions before closing the session to avoid BadSubscriptionIdInvalid noise, mirroring the ISubscribable sub cleanup. Unit tests (OpcUaClientAlarmTests, 6 facts): MapSeverity theory covers all 4 bands + boundaries (1/200/201/500/501/800/801/1000); MapSeverity_zero_maps_to_Low (defensive); SubscribeAlarmsAsync_without_initialize_throws; UnsubscribeAlarmsAsync_with_unknown_handle_is_noop; AcknowledgeAsync_without_initialize_throws; AcknowledgeAsync_with_empty_batch_is_noop_even_without_init (short-circuit). Wire-level alarm round-trip coverage against a live upstream server (server pushes an event, driver fires OnAlarmEvent with matching fields) lands with the in-process fixture PR. 67/67 OpcUaClient.Tests pass (54 prior + 13 new -- 6 alarm + 7 attribute mapping carry-over). dotnet build clean. 2026-04-19 02:09:04 -04:00
17f901bb65 Merge pull request 'Phase 3 PR 74 -- OPC UA Client transparent reconnect via SessionReconnectHandler' (#73) from phase-3-pr74-opcua-client-session-reconnect into v2 2026-04-19 02:06:48 -04:00
Joseph Doherty
ba3a5598e1 Phase 3 PR 74 -- OPC UA Client transparent reconnect via SessionReconnectHandler. Before this PR a session keep-alive failure flipped HostState to Stopped and stayed there until operator intervention. PR 74 wires the SDK's SessionReconnectHandler so the driver automatically retries + swaps in a new session when the upstream server comes back. New _reconnectHandler field lazily instantiated inside OnKeepAlive on a bad status; subsequent bad keep-alives during the same outage no-op (null-check prevents stacked handlers). Constructor uses (telemetry:null, reconnectAbort:false, maxReconnectPeriod:2min) -- reconnectAbort=false so the handler keeps trying across many retry cycles; 2min cap prevents pathological back-off from starving operator visibility. BeginReconnect takes the current ISession + ReconnectPeriod (from OpcUaClientDriverOptions, default 5s per driver-specs.md \u00A78) + our OnReconnectComplete callback. OnReconnectComplete reads handler.Session for the new session, unwires keepalive from the dead session, rewires to the new session (without this the NEXT drop wouldn't trigger another reconnect -- subtle and critical), swaps Session, disposes the handler. The SDK's Session.TransferSubscriptionsOnReconnect default=true handles subscription migration internally so local MonitoredItem handles stay live across the reconnect; no driver-side manual transfer needed. Shutdown path now aborts any in-flight reconnect via _reconnectHandler.CancelReconnect() + Dispose BEFORE touching Session.CloseAsync -- without this the handler's retry loop holds a reference to the about-to-close session and fights the close, producing BadSessionIdInvalid noise in the upstream log and potential disposal-race exceptions. Cancel-first is the documented SDK pattern. Kept the driver's own HostState/OnHostStatusChanged flow: bad keep-alive -> Stopped transition + reconnect kicks off; OnReconnectComplete -> Running transition + Healthy status. Downstream consumers see the bounce as Stopped->Running without needing to know about the reconnect handler internals. Unit tests (OpcUaClientReconnectTests, 3 facts): Default_ReconnectPeriod_matches_driver_specs_5_seconds (sanity check on the options default), Options_ReconnectPeriod_is_configurable_for_aggressive_or_relaxed_retry (500ms override works), Driver_starts_with_no_reconnect_handler_active_pre_init (lazy instantiation -- indirectly via lifecycle). Wire-level disconnect-reconnect-resume coverage against a live upstream server is deferred to the in-process-fixture PR -- testing the reconnect path needs a server we can kill + revive mid-test, non-trivial to scaffold in xUnit. 54/54 OpcUaClient.Tests pass (51 prior + 3 reconnect). dotnet build clean. 2026-04-19 02:04:42 -04:00
8cd932e7c9 Merge pull request 'Phase 3 PR 73 -- OPC UA Client browse enrichment' (#72) from phase-3-pr73-opcua-client-browse-enrichment into v2 2026-04-19 02:02:39 -04:00
Joseph Doherty
28328def5d Phase 3 PR 73 -- OPC UA Client browse enrichment (DataType + AccessLevel + ValueRank + Historizing). Before this PR discovered variables always registered with DriverDataType.Int32 + SecurityClassification.ViewOnly + IsArray=false as conservative placeholders -- correct wire-format NodeId but useless downstream metadata. PR 73 adds a two-pass browse. Pass 1 unchanged shape but now collects (ParentFolder, BrowseName, DisplayName, NodeId) tuples into a pendingVariables list instead of registering each variable inline; folders still register inline. Pass 2 calls Session.ReadAsync once with (variableCount * 4) ReadValueId entries reading DataType + ValueRank + UserAccessLevel + Historizing for every variable. Server-side chunking via the SDK keeps the request shape within the server's per-request limits automatically. Attribute mapping: MapUpstreamDataType maps every standard DataTypeIds.* to a DriverDataType -- Boolean, SByte+Byte widened to Int16 (DriverDataType has no 8-bit, flagged in comment for future Core.Abstractions widening), Int16/32/64, UInt16/32/64, Float->Float32, Double->Float64, String, DateTime+UtcTime->DateTime. Unknown/vendor-custom NodeIds fall back to String -- safest passthrough for Variant-wrapped structs/enums/extension objects since the cascading-quality path preserves upstream StatusCode+timestamps regardless. MapAccessLevelToSecurityClass reads AccessLevels.CurrentWrite bit (0x02) -- when set, the variable is writable-for-this-user so it surfaces as Operate; otherwise ViewOnly. Uses UserAccessLevel not AccessLevel because UserAccessLevel is post-ACL-filter -- reflects what THIS session can actually do, not the server's default. IsArray derived from ValueRank (-1 = scalar, 0 = 1-D array, 1+ = multi-dim). IsHistorized reflects the server's Historizing flag directly so PR 76's IHistoryProvider routing can gate on it. Graceful degradation: (a) individual attribute failures (Bad StatusCode on DataType read) fall through to the type defaults, variable still registers; (b) wholesale enrichment-read failure (e.g. session dropped mid-browse) catches the exception, registers every pending variable with fallback defaults via RegisterFallback, browse completes. Either way the downstream address space is never empty when browse succeeded the first pass -- partial metadata is strictly better than missing variables. Unit tests (OpcUaClientAttributeMappingTests, 20 facts): MapUpstreamDataType theory covers 11 standard types including Boolean/Int16/UInt16/Int32/UInt32/Int64/UInt64/Float/Double/String/DateTime; separate facts for SByte+Byte (widened to Int16), UtcTime (DateTime), custom NodeId (String fallback); MapAccessLevelToSecurityClass theory covers 6 access-level bitmasks including CurrentRead-only (ViewOnly), CurrentWrite-only (Operate), read+write (Operate), HistoryRead-only (ViewOnly -- no Write bit). 51/51 OpcUaClient.Tests pass (31 prior + 20 new). dotnet build clean. Pending variables structured as a private readonly record struct so the ref-type allocation is stack-local for typical browse sizes. Paves the way for PR 74 SessionReconnectHandler (same enrichment path is re-runnable on reconnect) + PR 76 IHistoryProvider (gates on IsHistorized). 2026-04-19 02:00:31 -04:00
d3bf544abc Merge pull request 'Phase 3 PR 72 -- Multi-endpoint failover for OPC UA Client' (#71) from phase-3-pr72-opcua-client-failover into v2 2026-04-19 01:54:36 -04:00
Joseph Doherty
24435712c4 Phase 3 PR 72 -- Multi-endpoint failover for OPC UA Client driver. Adds OpcUaClientDriverOptions.EndpointUrls ordered list + PerEndpointConnectTimeout knob. On InitializeAsync the driver walks the candidate list in order via ResolveEndpointCandidates and returns the session from the first endpoint that successfully connects. Captures per-URL failure reasons in a List<string> and, if every candidate fails, throws AggregateException whose message names every URL + its failure class (e.g. 'opc.tcp://primary:4840 -> TimeoutException: ...'). That's critical diag for field debugging -- without it 'failover picked the wrong one' surfaces as a mystery. Single-URL backwards compat: EndpointUrl field retained as a one-URL shortcut. When EndpointUrls is null or empty the driver falls through to a single-candidate list of [EndpointUrl], so every existing single-endpoint config keeps working without migration. When both are provided, EndpointUrls wins + EndpointUrl is ignored -- documented on the field xml-doc. Per-endpoint connect budget: PerEndpointConnectTimeout (default 3s) caps each attempt so a sweep over several dead servers can't blow the overall init budget. Applied via CancellationTokenSource.CreateLinkedTokenSource + CancelAfter inside OpenSessionOnEndpointAsync (the extracted single-endpoint connect helper) so the cap is independent of the outer Options.Timeout which governs steady-state ops. BuildUserIdentity extracted out of InitializeAsync so the failover loop builds the UserIdentity ONCE and reuses it across every endpoint attempt -- generating it N times would re-unlock the user cert's private key N times, wasteful + keeps the password in memory longer. HostName now reflects the endpoint that actually connected via _connectedEndpointUrl instead of always returning opts.EndpointUrl -- so the Admin /hosts dashboard shows which of the configured endpoints is currently serving traffic (primary vs backup). Falls back to the first candidate pre-connect so the dashboard has a sensible identity before the first connect, and resets to null on ShutdownAsync. Use case: an OPC UA hot-standby server pair (primary 4840 + backup 4841) where either can serve the same address space. Operator configures EndpointUrls=[primary, backup]; driver tries primary first, falls over to backup on primary failure with a clean AggregateException describing both attempts if both are down. Unit tests (OpcUaClientFailoverTests, 5 facts): ResolveEndpointCandidates_prefers_EndpointUrls_when_provided (list trumps single), ResolveEndpointCandidates_falls_back_to_single_EndpointUrl_when_list_empty (legacy config compat), ResolveEndpointCandidates_empty_list_treated_as_fallback (explicit empty list also falls back -- otherwise we'd produce a zero-candidate sweep that throws with nothing tried), HostName_uses_first_candidate_before_connect (dashboard rendering pre-connect), Initialize_against_all_unreachable_endpoints_throws_AggregateException_listing_each (three loopback dead ports, asserts each URL appears in the aggregate message + driver flips to Faulted). 31/31 OpcUaClient.Tests pass. dotnet build clean. OPC UA Client driver security/auth/availability feature set now complete per driver-specs.md \u00A78: policy-filtered endpoint selection (PR 70), Anonymous+Username+Certificate auth (PR 71), multi-endpoint failover (this PR). 2026-04-19 01:52:31 -04:00
3f7b4d05e6 Merge pull request 'Phase 3 PR 71 -- OpcUaAuthType.Certificate user authentication' (#70) from phase-3-pr71-opcua-client-cert-auth into v2 2026-04-19 01:49:29 -04:00
Joseph Doherty
a79c5f3008 Phase 3 PR 71 -- OpcUaAuthType.Certificate user authentication. Implements the third user-token type in the OPC UA spec (Anonymous + UserName + Certificate). Before this PR the Certificate branch threw NotSupportedException. Adds OpcUaClientDriverOptions.UserCertificatePath + UserCertificatePassword knobs for the PFX on disk. The InitializeAsync user-identity switch now calls BuildCertificateIdentity for AuthType=Certificate. Load path uses X509CertificateLoader.LoadPkcs12FromFile -- the non-obsolete .NET 9+ API; the legacy X509Certificate2 PFX ctors are deprecated on net10. Validation up-front: empty UserCertificatePath throws InvalidOperationException naming the missing field; non-existent file throws FileNotFoundException with path; private-key-missing throws InvalidOperationException explaining the private key is required to sign the OPC UA user-token challenge at session activation. Each failure mode is an operator-actionable config problem rather than a mysterious ServiceResultException during session open. UserIdentity(X509Certificate2) ctor carries the cert directly; the SDK sets TokenType=Certificate + wires the cert's public key into the activate-session payload. Private key stays in-memory on the OpenSSL / .NET crypto boundary. Unit tests (OpcUaClientCertAuthTests, 3 facts): BuildCertificateIdentity_rejects_missing_path (error message mentions UserCertificatePath so the fix is obvious); BuildCertificateIdentity_rejects_nonexistent_file (FileNotFoundException); BuildCertificateIdentity_loads_a_valid_PFX_with_private_key -- generates a self-signed RSA-2048 cert on the fly with CertificateRequest.CreateSelfSigned, exports to temp PFX with a password, loads it through the helper, asserts TokenType=Certificate. Test cleans up the temp file in a finally block (best-effort; Windows file locking can leave orphans which is acceptable for %TEMP%). Self-signed cert-on-the-fly avoids shipping a static test PFX that could be flagged by secret-scanners and keeps the test hermetic across dev boxes. 26/26 OpcUaClient.Tests pass (23 prior + 3 cert auth). dotnet build clean. Feature: Anonymous + Username + Certificate all work -- driver-specs.md \u00A78 auth story complete. 2026-04-19 01:47:18 -04:00
a5299a2fee Merge pull request 'Phase 3 PR 70 -- Apply SecurityPolicy + expand to standard OPC UA policies' (#69) from phase-3-pr70-opcua-client-security-policy into v2 2026-04-19 01:46:13 -04:00
Joseph Doherty
a65215684c Phase 3 PR 70 -- Apply SecurityPolicy explicitly + expand to standard OPC UA policy list. Before this PR SecurityPolicy was a string field that got ignored -- the driver only passed useSecurity=SecurityMode!=None to SelectEndpointAsync, so an operator asking for Basic256Sha256 on a server that also advertised Basic128Rsa15 could silently end up on the weaker cipher (the SDK's SelectEndpoint returns whichever matching endpoint the server listed first). PR 70 makes policy matching explicit. SecurityPolicy is now an OpcUaSecurityPolicy enum covering the six standard policies documented in OPC UA 1.04: None, Basic128Rsa15 (deprecated, brownfield interop only), Basic256 (deprecated), Basic256Sha256 (recommended baseline), Aes128_Sha256_RsaOaep, Aes256_Sha256_RsaPss. Each maps through MapSecurityPolicy to the SecurityPolicies URI constant the SDK uses for endpoint matching. New SelectMatchingEndpointAsync replaces CoreClientUtils.SelectEndpointAsync. Flow: opens a DiscoveryClient via the non-obsolete DiscoveryClient.CreateAsync(ApplicationConfiguration, Uri, DiagnosticsMasks, ct) path, calls GetEndpointsAsync to enumerate every endpoint the server advertises, filters client-side by policy URI AND mode. When no endpoint matches, throws InvalidOperationException with the full list of what the server DID advertise formatted as 'Policy/Mode' pairs so the operator sees exactly what to fix in their config without a Wireshark trace. Fail-loud behaviour intentional -- a silent fall-through to weaker crypto is worse than a clear config error. MapSecurityPolicy is internal-visible to tests via InternalsVisibleTo from PR 66. Unit tests (OpcUaClientSecurityPolicyTests, 5 facts): MapSecurityPolicy_returns_known_non_empty_uri_for_every_enum_value theory covers all 6 policies; URI contains the enum name for non-None so operators can grep logs back to the config value; MapSecurityPolicy_None_matches_SDK_None_URI, MapSecurityPolicy_Basic256Sha256_matches_SDK_URI, MapSecurityPolicy_Aes256_Sha256_RsaPss_matches_SDK_URI all cross-check against the SDK's SecurityPolicies.* constants to catch a future enum-vs-URI drift; Every_enum_value_has_a_mapping walks Enum.GetValues to ensure adding a new case doesn't silently fall through the switch. Scaffold test updated to assert SecurityPolicy default = None (was previously unchecked). 23/23 OpcUaClient.Tests pass (13 prior + 5 scaffold + 5 new policy). dotnet build clean. Note on DiscoveryClient: the synchronous DiscoveryClient.Create(...) overloads are all [Obsolete] in SDK 1.5.378; must use DiscoveryClient.CreateAsync. GetEndpointsAsync(null, ct) returns EndpointDescriptionCollection directly (not a wrapper). 2026-04-19 01:44:07 -04:00
82f2dfcfa3 Merge pull request 'Phase 3 PR 69 -- OPC UA Client ISubscribable + IHostConnectivityProbe' (#68) from phase-3-pr69-opcua-client-subscribe-probe into v2 2026-04-19 01:24:21 -04:00
Joseph Doherty
0433d3a35e Phase 3 PR 69 -- OPC UA Client ISubscribable + IHostConnectivityProbe. Completes the OpcUaClientDriver capability surface — now matches the Galaxy + Modbus + S7 driver coverage. ISubscribable: SubscribeAsync creates a new upstream Subscription via the non-obsolete Subscription(ITelemetryContext, SubscriptionOptions) ctor + AddItem/CreateItemsAsync flow, wires each MonitoredItem's Notification event into OnDataChange. Tag strings round-trip through MonitoredItem.Handle so the notification handler can identify which tag changed without a second lookup. Publishing interval floored at 50ms (servers negotiate up anyway; sub-50ms wastes round-trip). SubscriptionOptions uses KeepAliveCount=10, LifetimeCount=1000, TimestampsToReturn=Both so SourceTimestamp passthrough for the cascading-quality rule works through subscription paths too. UnsubscribeAsync calls Subscription.DeleteAsync(silent:true) and tolerates unknown handles (returns cleanly) because the caller's race with server-side cleanup after a session drop shouldn't crash either side. Session shutdown explicitly deletes every remote subscription before closing — avoids BadSubscriptionIdInvalid noise in the upstream server's log on Close. IHostConnectivityProbe: HostName surfaced as the EndpointUrl (not host:port like the Modbus/S7 drivers) so the Admin /hosts dashboard can render the full opc.tcp:// URL as a clickable target back at the remote server. HostState tracked via session.KeepAlive event — OPC UA's built-in keep-alive is authoritative for session liveness (the SDK pings on KeepAliveInterval, sets KeepAliveStopped after N missed pings), strictly better than a driver-side polling probe: no extra wire round-trip, no duplicate semantic with the native protocol. Handler transitions Running on healthy keep-alives and Stopped on any Bad service-result. Initial Running raised at end of InitializeAsync once the session is up; Shutdown transitions back to Unknown + unwires the handler. Unit tests (OpcUaClientSubscribeAndProbeTests, 3 facts): SubscribeAsync_without_initialize_throws_InvalidOperationException, UnsubscribeAsync_with_unknown_handle_is_noop (session-drop-race safety), GetHostStatuses_returns_endpoint_url_row_pre_init (asserts EndpointUrl as the host identity -- the full opc.tcp://plc.example:4840 URL). Live-session subscribe/unsubscribe round-trip + keep-alive state transition coverage lands in a follow-up PR once we scaffold the in-process OPC UA server fixture. 13/13 OpcUaClient.Tests pass. dotnet build clean. All six capability interfaces (IDriver / ITagDiscovery / IReadable / IWritable / ISubscribable / IHostConnectivityProbe) implemented — OPC UA Client driver surface complete. 2026-04-19 01:22:14 -04:00
141673fc80 Merge pull request 'Phase 3 PR 68 -- OPC UA Client ITagDiscovery (Full browse)' (#67) from phase-3-pr68-opcua-client-discovery into v2 2026-04-19 01:19:27 -04:00
Joseph Doherty
db56a95819 Phase 3 PR 68 -- OPC UA Client ITagDiscovery via recursive browse (Full strategy). Adds ITagDiscovery to OpcUaClientDriver. DiscoverAsync opens a single Remote folder on the IAddressSpaceBuilder and recursively browses from the configured root (default: ObjectsFolder i=85; override via OpcUaClientDriverOptions.BrowseRoot for scoped discovery). Browse uses non-obsolete Session.BrowseAsync(RequestHeader, ViewDescription, uint maxReferences, BrowseDescriptionCollection, ct) with HierarchicalReferences forward, subtypes included, NodeClassMask Object+Variable, ResultMask pulling BrowseName + DisplayName + NodeClass + TypeDefinition. Objects become sub-folders via builder.Folder; Variables become builder.Variable entries with FullName set to the NodeId.ToString() serialization so IReadable/IWritable can round-trip without re-resolving. Three safety caps added to OpcUaClientDriverOptions to bound runaway discovery: (1) MaxBrowseDepth default 10 -- deep enough for realistic OPC UA information models, shallow enough that cyclic graphs can't spin the browse forever. (2) MaxDiscoveredNodes default 10_000 -- caps memory on pathological remote servers. Once the cap is hit, recursion short-circuits and the partially-discovered tree is still projected into the local address space (graceful degradation rather than all-or-nothing). (3) BrowseRoot as an opt-in scope restriction string per driver-specs.md \u00A78 -- defaults to ObjectsFolder but operators with 100k-node servers can point it at a single subtree. Visited-set tracks NodeIds already visited to prevent infinite cycles on graphs with non-strict hierarchy (OPC UA models can have back-references). Transient browse failures on a subtree are swallowed -- the sub-branch stops but the rest of discovery continues, matching the Modbus driver's 'transient poll errors don't kill the loop' pattern. The driver's health surface reflects the network-level cascade via the probe loop (PR 69). Deferred to a follow-up PR: DataType resolution via a batch Session.ReadAsync(Attributes.DataType) after the browse so DriverAttributeInfo.DriverDataType is accurate instead of the current conservative DriverDataType.Int32 default; AccessLevel-derived SecurityClass instead of the current ViewOnly default; array-type detection via Attributes.ValueRank + ArrayDimensions. These need an extra wire round-trip per batch of variables + a NodeId -> DriverDataType mapping table; out of scope for PR 68 to keep browse path landable. Unit tests (OpcUaClientDiscoveryTests, 3 facts): DiscoverAsync_without_initialize_throws_InvalidOperationException (pre-init hits RequireSession); DiscoverAsync_rejects_null_builder (ArgumentNullException); Discovery_caps_are_sensible_defaults (asserts 10000 / 10 / null defaults documented above). NullAddressSpaceBuilder stub implements the full IAddressSpaceBuilder shape including IVariableHandle.MarkAsAlarmCondition (throws NotSupportedException since this PR doesn't wire alarms). Live-browse coverage against a real remote server is deferred to the in-process-server-fixture PR. 10/10 OpcUaClient.Tests pass. dotnet build clean. 2026-04-19 01:17:21 -04:00
89bd726fa8 Merge pull request 'Phase 3 PR 67 -- OPC UA Client IReadable + IWritable' (#66) from phase-3-pr67-opcua-client-read-write into v2 2026-04-19 01:15:42 -04:00
Joseph Doherty
238748bc98 Phase 3 PR 67 -- OPC UA Client IReadable + IWritable via Session.ReadAsync/WriteAsync. Adds IReadable + IWritable capabilities to OpcUaClientDriver, routing reads/writes through the session's non-obsolete ReadAsync(RequestHeader, maxAge, TimestampsToReturn, ReadValueIdCollection, ct) and WriteAsync(RequestHeader, WriteValueCollection, ct) overloads (the sync and BeginXxx/EndXxx patterns are all [Obsolete] in SDK 1.5.378). Serializes on the shared Gate from PR 66 so reads + writes + future subscribe + probe don't race on the single session. NodeId parsing: fullReferences use OPC UA's standard serialized NodeId form -- ns=2;s=Demo.Counter, i=2253, ns=4;g=... for GUID, ns=3;b=... for opaque. TryParseNodeId calls NodeId.Parse with the session's MessageContext which honours the server-negotiated namespace URI table. Malformed input surfaces as BadNodeIdInvalid (0x80330000) WITHOUT a wire round-trip -- saves a request for a fault the driver can detect locally. Cascading-quality implementation per driver-specs.md \u00A78: upstream StatusCode, SourceTimestamp, and ServerTimestamp pass through VERBATIM. Bad codes from the remote server stay as the same Bad code (not translated to generic BadInternalError) so downstream clients can distinguish 'upstream value unavailable' from 'local driver bug'. SourceTimestamp is preserved verbatim (null on MinValue guard) so staleness is visible; ServerTimestamp falls back to DateTime.UtcNow if the upstream omitted it, never overwriting a non-zero value. Wire-level exceptions in the Read batch -- transport / timeout / session-dropped -- fan out BadCommunicationError (0x80050000) across every tag in the batch, not BadInternalError, so operators distinguish network reachability from driver faults. Write-side same pattern: successful WriteAsync maps each upstream StatusCode.Code verbatim into the local WriteResult.StatusCode; transport-layer failure fans out BadCommunicationError across the whole batch. WriteValue carries AttributeId=Value + DataValue wrapping Variant(writeValue) -- the SDK handles the type-to-Variant mapping for common CLR types (bool, int, float, string, etc.) so the driver doesn't need a per-type switch. Name disambiguation: the SDK has its own Opc.Ua.WriteRequest type which collides with ZB.MOM.WW.OtOpcUa.Core.Abstractions.WriteRequest; method signature uses the fully-qualified Core.Abstractions.WriteRequest. Unit tests (OpcUaClientReadWriteTests, 2 facts): ReadAsync_without_initialize_throws_InvalidOperationException + WriteAsync_without_initialize_throws_InvalidOperationException -- pre-init calls hit RequireSession and fail uniformly. Wire-level round-trip coverage against a live remote server lands in a follow-up PR once we scaffold an in-process OPC UA server fixture (the existing Server project in the solution is a candidate host). 7/7 OpcUaClient.Tests pass (5 scaffold + 2 read/write). dotnet build clean. Scope: ITagDiscovery (browse) + ISubscribable + IHostConnectivityProbe remain deferred to PRs 68-69 which also need namespace-index remapping and reference-counted MonitoredItem forwarding per driver-specs.md \u00A78. 2026-04-19 01:13:34 -04:00
b21d550836 Merge pull request 'Phase 3 PR 66 -- OPC UA Client (gateway) driver scaffold' (#65) from phase-3-pr66-opcua-client-scaffold into v2 2026-04-19 01:10:07 -04:00
Joseph Doherty
91eaf534c8 Phase 3 PR 66 -- OPC UA Client (gateway) driver project scaffold + IDriver session lifecycle. First driver that CONSUMES OPC UA rather than PUBLISHES it -- connects to a remote server and re-exposes its address space through the local OtOpcUa server per driver-specs.md \u00A78. Uses the same OPCFoundation.NetStandard.Opc.Ua.Client package the existing Client.Shared ships (bumped to 1.5.378.106 to match). Builds its own ApplicationConfiguration (cert stores under %LocalAppData%/OtOpcUa/pki so multiple driver instances in one OtOpcUa server process share a trust anchor) rather than reusing Client.Shared -- Client.Shared is oriented at the interactive CLI with different session-lifetime needs (this driver is always-on, needs keep-alive + session transfer on reconnect + multi-year uptime). Navigated the post-refactor 1.5.378 SDK surface: every Session.Create* static is now [Obsolete] in favour of DefaultSessionFactory; CoreClientUtils.SelectEndpoint got the sync overloads deprecated in favour of SelectEndpointAsync with a required ITelemetryContext parameter. Driver passes telemetry: null! to both SelectEndpointAsync + new DefaultSessionFactory(telemetry: null!) -- the SDK's internal default sink handles null gracefully and plumbing a telemetry context through the driver options surface is out of scope (the driver emits its own logs via the DriverHealth surface anyway). ApplicationInstance default ctor is also obsolete; wrapped in #pragma warning disable CS0618 rather than migrate to the ITelemetryContext overload for the same reason. OpcUaClientDriverOptions models driver-specs.md \u00A78 settings: EndpointUrl (default opc.tcp://localhost:4840 IANA-assigned port), SecurityPolicy/SecurityMode/AuthType enums, Username/Password, SessionTimeout=120s + KeepAliveInterval=5s + ReconnectPeriod=5s (defaults from spec), AutoAcceptCertificates=false (production default; dev turns on for self-signed servers), ApplicationUri + SessionName knobs for certificate SAN matching and remote-server session-list identification. OpcUaClientDriver : IDriver: InitializeAsync builds the ApplicationConfiguration, resolves + creates cert if missing via app.CheckApplicationInstanceCertificatesAsync, selects endpoint via CoreClientUtils.SelectEndpointAsync, builds UserIdentity (Anonymous or Username with UTF-8-encoded password bytes -- the legacy string-password ctor went away; Certificate auth deferred), creates session via DefaultSessionFactory.CreateAsync. Health transitions Unknown -> Initializing -> Healthy on success or -> Faulted on failure with best-effort Session.CloseAsync cleanup. ShutdownAsync (async now, not Task.CompletedTask) closes the session + disposes. Internal Session + Gate expose to the test project via InternalsVisibleTo so PRs 67-69 can stack read/write/discovery/subscribe on the same serialization. Scaffold tests (OpcUaClientDriverScaffoldTests, 5 facts): Default_options_target_standard_opcua_port_and_anonymous_auth (4840 + None mode + Anonymous + AutoAccept=false production default), Default_timeouts_match_driver_specs_section_8 (120s/5s/5s), Driver_reports_type_and_id_before_connect (DriverType=OpcUaClient, DriverInstanceId round-trip, pre-init Unknown health), Initialize_against_unreachable_endpoint_transitions_to_Faulted_and_throws, Reinitialize_against_unreachable_endpoint_re_throws. Uses opc.tcp://127.0.0.1:1 as the 'guaranteed-unreachable' target -- RFC 5737 reserved IPs get black-holed and time out only after the SDK's internal retry/backoff fully elapses (~60s), while port 1 on loopback refuses immediately with TCP RST which keeps the test suite snappy (5 tests / 8s). 5/5 pass. dotnet build clean. Scope boundary: ITagDiscovery / IReadable / IWritable / ISubscribable / IHostConnectivityProbe deliberately NOT in this PR -- they need browse + namespace remapping + reference-counted MonitoredItem forwarding + keep-alive probing and land in PRs 67-69. 2026-04-19 01:07:57 -04:00
d33e38e059 Merge pull request 'Phase 3 PR 65 -- S7 ITagDiscovery + ISubscribable + IHostConnectivityProbe' (#64) from phase-3-pr65-s7-discovery-subscribe-probe into v2 2026-04-19 00:18:17 -04:00
Joseph Doherty
d8ef35d5bd Phase 3 PR 65 -- S7 ITagDiscovery + ISubscribable polling overlay + IHostConnectivityProbe. Three more capability interfaces on S7Driver, matching the Modbus driver's capability coverage. ITagDiscovery: DiscoverAsync streams every configured tag into IAddressSpaceBuilder under a single 'S7' folder; builder.Variable gets a DriverAttributeInfo carrying DriverDataType (MapDataType: Bool->Boolean, Byte/Int/UInt sizes->Int32 (until Core.Abstractions adds widths), Float32/Float64 direct, String + DateTime direct), SecurityClass (Operate if tag.Writable else ViewOnly -- matches the Modbus pattern so DriverNodeManager's ACL layer can gate writes per role without S7-specific logic), IsHistorized=false (S7 has no native historian surface), IsAlarm=false (S7 alarms land through TIA Portal's alarm-in-DB pattern which is per-site and out of scope for PR 65). ISubscribable polling overlay: same pattern Modbus established in PR 22. SubscribeAsync spawns a Task.Run loop that polls every tag, diffs against LastValues, raises OnDataChange on changes plus a force-raise on initial-data push per OPC UA Part 4 convention. Interval floored at 100ms -- S7 CPUs scan 2-10ms but process the comms mailbox at most once per scan, so sub-scan polling just queues wire-side with worse latency per S7netplus documented pattern. Poll errors tolerated: first-read fault doesn't kill the loop (caller can't receive initial values but subsequent polls try again); transient poll errors also swallowed so the loop survives a power-cycle + reconnect through the health surface. UnsubscribeAsync cancels the CTS + removes the subscription -- unknown handle is a no-op, not a throw, because the caller's race with server-side cleanup shouldn't crash either side. Shutdown tears down every subscription before disposing the Plc. IHostConnectivityProbe: HostName surfaced as host:port to match Modbus driver convention (Admin /hosts dashboard renders both families uniformly). GetHostStatuses returns one row (single-endpoint driver). ProbeLoopAsync serializes on the shared Gate + calls Plc.ReadStatusAsync (cheap Get-CPU-Status PDU that doubles as an 'is PLC up' check) every Probe.Interval with a Probe.Timeout cap, transitions HostState Unknown/Stopped -> Running on success and -> Stopped on any failure, raises OnHostStatusChanged only on actual transitions (no noise for steady-state probes). Probe loop starts at end of InitializeAsync when Probe.Enabled=true (default); Shutdown cancels the probe CTS. Initial state stays Unknown until first successful probe -- avoids broadcasting a premature Running before any PDU round-trip has happened. Unit tests (S7DiscoveryAndSubscribeTests, 4 facts): DiscoverAsync_projects_every_tag_into_the_address_space (3 tags + mixed writable/read-only -> Operate vs ViewOnly asserted), GetHostStatuses_returns_one_row_with_host_port_identity_pre_init, SubscribeAsync_returns_unique_handles_and_UnsubscribeAsync_accepts_them (diagnosticId uniqueness + idempotent double-unsubscribe), Subscribe_publishing_interval_is_floored_at_100ms (accepts 50ms request without throwing -- floor is applied internally). Uses a RecordingAddressSpaceBuilder stub that implements IVariableHandle.FullReference + MarkAsAlarmCondition (throws NotImplementedException since the S7 driver never calls it -- alarms out of scope). 57/57 S7 unit tests pass. dotnet build clean. All 5 capability interfaces (IDriver/ITagDiscovery/IReadable/IWritable/ISubscribable/IHostConnectivityProbe) now implemented -- the S7 driver surface is on par with the Modbus driver, minus the extended data types (Int64/UInt64/Float64/String/DateTime deferred per PR 64). 2026-04-19 00:16:10 -04:00
5e318a1ab6 Merge pull request 'Phase 3 PR 64 -- S7 IReadable + IWritable via S7.Net' (#63) from phase-3-pr64-s7-read-write into v2 2026-04-19 00:12:59 -04:00
Joseph Doherty
394d126b2e Phase 3 PR 64 -- S7 IReadable + IWritable via S7.Net string-based Plc.ReadAsync/WriteAsync. Adds IReadable + IWritable capability interfaces to S7Driver, routing reads/writes through S7netplus's string-address API (Plc.ReadAsync(string, ct) / Plc.WriteAsync(string, object, ct)). All operations serialize on the class's SemaphoreSlim Gate because S7netplus mandates one Plc connection per PLC with client-side serialization -- parallel reads against a single S7 CPU queue wire-side anyway and just eat connection-resource budget. Supported data types in this PR: Bool, Byte, Int16, UInt16, Int32, UInt32, Float32. S7.Net's string-based read returns UNSIGNED boxed values (DBX=bool, DBB=byte, DBW=ushort, DBD=uint); the driver reinterprets them into the requested S7DataType via the (DataType, Size, raw) switch: unchecked short-cast for Int16, unchecked int-cast for Int32, BitConverter.UInt32BitsToSingle for Float32. Writes inverse the conversion -- Int16 -> unchecked ushort cast, Int32 -> unchecked uint cast, Float32 -> BitConverter.SingleToUInt32Bits -- before handing to S7.Net's WriteAsync. This avoids a second PLC round-trip that a typed ReadAsync(DataType, db, offset, VarType, ...) overload would need. Int64, UInt64, Float64, String, DateTime throw NotSupportedException (-> BadNotSupported StatusCode); S7 STRING has non-trivial header semantics + LReal/DateTime need typed S7.Net API paths, both land in a follow-up PR when scope demands. InitializeAsync now parses every tag's Address string via S7AddressParser at init time. Bad addresses throw FormatException and flip health to Faulted -- callers can't register a broken driver. The parsed form goes into _parsedByName so Read/Write can consult Size/BitOffset without re-parsing per operation. StatusCode mapping in catch chain: unknown tag name -> BadNodeIdUnknown (0x80340000), unsupported data type -> BadNotSupported (0x803D0000), read-only tag write attempt -> BadNotWritable (0x803B0000), S7.Net PlcException (carries PUT/GET-disabled signal on S7-1200/1500) -> BadDeviceFailure (0x80550000) so operators see a TIA-Portal config problem rather than a transient-fault false flag per driver-specs.md \u00A75, any other runtime exception on read -> BadCommunicationError (0x80050000) to distinguish socket/timeout from tag-level faults. Write generic-exception path stays BadInternalError because write failures can legitimately be driver-side value-range problems. Unit tests (S7DriverReadWriteTests, 3 facts): Initialize_rejects_invalid_tag_address_and_fails_fast -- Tags with a malformed address must throw at InitializeAsync rather than producing a half-healthy driver; ReadAsync_without_initialize_throws_InvalidOperationException + WriteAsync_without_initialize_throws_InvalidOperationException -- pre-init calls hit RequirePlc and throw the uniform 'not initialized' message. Wire-level round-trip coverage (integration test against a live S7-1500 or a mock S7 server) is deferred -- S7.Net doesn't ship an in-process fake and a conformant mock is non-trivial. 53/53 Modbus.Driver.S7.Tests pass (50 parser + 3 read/write). dotnet build clean. 2026-04-19 00:10:41 -04:00
0eab1271be Merge pull request 'Phase 3 PR 63 -- S7AddressParser (DB/M/I/Q/T/C grammar)' (#62) from phase-3-pr63-s7-address-parser into v2 2026-04-19 00:08:27 -04:00
Joseph Doherty
d5034c40f7 Phase 3 PR 63 -- S7AddressParser for DB/M/I/Q/T/C address strings. Adds S7AddressParser + S7ParsedAddress + S7Area + S7Size to the Driver.S7 project. Grammar follows driver-specs.md \u00A75 + Siemens TIA Portal / STEP 7 Classic convention: (1) Data blocks: DB{n}.DB{X|B|W|D}{offset}[.bit] where X=bit (requires .bit suffix 0-7), B=byte, W=word (16-bit), D=dword (32-bit). (2) Merkers: MB{n}, MW{n}, MD{n}, or M{n}.{bit} for bit access. (3) Inputs + Outputs: same {B|W|D} prefix or {n}.{bit} pattern as M. (4) Timers: T{n}. (5) Counters: C{n}. Output is an immutable S7ParsedAddress record struct with Area (DataBlock / Memory / Input / Output / Timer / Counter), DbNumber (only meaningful for DataBlock), Size (Bit / Byte / Word / DWord), ByteOffset (also timer/counter number when Area is Timer/Counter), BitOffset (0-7 for Size=Bit; 0 otherwise). Case-insensitive via ToUpperInvariant, whitespace trimmed on entry. Parse throws FormatException with the offending input echoed in the message; TryParse returns bool for config-validation callers that can't afford exceptions (e.g. Admin UI tag-editor live validation). Strict rejection policy -- 16 garbage cases covered in the theory test: empty/whitespace input, unknown area letter (Z0), DB without number/tail, DB bit size without .bit suffix, bit offset 8+, word/dword with .bit suffix, DB number 0 (must be >=1), non-numeric DB number, unknown size letter (Q), M without offset, M bit access without .bit, bit 8, negative offset, non-digit offset, non-numeric timer. Strict rejection surfaces config errors at driver-init time rather than as BadInternalError on every Read against the bad tag. No driver code wires through yet -- PR 64 is where IReadable/IWritable consume S7ParsedAddress and translate into S7netplus Plc.ReadAsync calls (the S7.Net address grammar is a strict subset of what we accept, and the parser's S7ParsedAddress is the bridge). Unit tests (S7AddressParserTests, 50 facts): parse-valid theories for DB/M/I/Q/T/C covering all size variants + edge bit offsets 0 and 7; case-insensitive + whitespace-trim theory; reject-invalid theory with 16 garbage cases; TryParse round-trip for valid and invalid inputs. 50/50 pass, dotnet build clean. 2026-04-19 00:06:24 -04:00
5e67c49f7c Merge pull request 'Phase 3 PR 62 -- Siemens S7 native driver project scaffold' (#61) from phase-3-pr62-s7-driver-scaffold into v2 2026-04-19 00:05:17 -04:00
Joseph Doherty
0575280a3b Phase 3 PR 62 -- Siemens S7 native driver project scaffold (S7comm via S7netplus). First non-Modbus in-process driver. Creates src/ZB.MOM.WW.OtOpcUa.Driver.S7 (.NET 10, x64 -- S7netplus is managed, no bitness constraint like MXAccess) + tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests + slnx entries. Depends on S7netplus 0.20.0 which is the latest version on NuGet resolvable in this cache (0.21.0 per driver-specs.md is not yet published; 0.20.0 covers the same Plc+CpuType+ReadAsync surface). S7DriverOptions captures the connection settings documented in driver-specs.md \u00A75: Host, Port (default 102 ISO-on-TCP), CpuType (default S71500 per most-common deployment), Rack=0, Slot=0 (S7-1200/1500 onboard PN convention; S7-300/400 operators must override to slot 2 or 3), Timeout=5s, Tags list + Probe settings with default MW0 probe address. S7TagDefinition uses S7.Net-style address strings (DB1.DBW0, M0.0, I0.0, QD4) with an S7DataType enum (Bool, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Float32, Float64, String, DateTime -- the full type matrix from the spec); StringLength defaults to 254 (S7 STRING max). S7Driver implements the IDriver-only subset per the PR plan: InitializeAsync opens a managed Plc with the configured CpuType + Host + Rack + Slot, pins WriteTimeout / ReadTimeout on the underlying TcpClient, awaits Plc.OpenAsync with a linked CTS bounded by Options.Timeout so the ISO handshake itself respects the configured bound; health transitions Unknown -> Initializing -> Healthy on success or Unknown -> Initializing -> Faulted on handshake failure, with a best-effort Plc.Close() on the faulted path so retries don't leak the TcpClient. ShutdownAsync closes the Plc and flips health back to Unknown. DisposeAsync routes through ShutdownAsync + disposes the SemaphoreSlim. Internal Gate + Plc accessors are exposed to the test project (InternalsVisibleTo) so PRs 63-65 can stack read/write/subscribe on the same serialization semaphore per the S7netplus documented 'one Plc per PLC, SemaphoreSlim-serialized' pattern. ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe are all deliberately omitted from this PR -- they depend on the S7AddressParser (PR 63) and land sequenced in PRs 64-65. Unit tests (S7DriverScaffoldTests, 5 facts): default options target S7-1500 / port 102 / slot 0, default probe interval 5s, tag defaults to writable with StringLength 254, driver reports DriverType=S7 + Unknown health pre-init, Initialize against RFC-5737 reserved IP 192.0.2.1 with 250ms timeout transitions to Faulted and throws (tests the connect-failure path doesn't leave the driver in an ambiguous state). 5/5 pass. dotnet build ZB.MOM.WW.OtOpcUa.slnx: 0 errors. No regression in Modbus / Galaxy suites. PR 63 ships S7AddressParser next, PR 64 wires IReadable/IWritable over S7netplus, PR 65 adds discovery + polling-overlay subscribe + probe. 2026-04-19 00:03:09 -04:00
8150177296 Merge pull request 'Phase 2 PR 61 -- Close V1_ARCHIVE_STATUS.md: Streams D + E done' (#60) from phase-2-pr61-scrub-v1-archive-residue into v2 2026-04-18 23:22:58 -04:00
Joseph Doherty
56d8af8bdb Phase 2 PR 61 -- Close V1_ARCHIVE_STATUS.md; Phase 2 Streams D + E done. Purely a documentation-closure PR. The v1 archive deletion itself happened across earlier PRs: PR 2 on phase-2-stream-d archive-marked the four v1 projects (IsTestProject=false so dotnet test slnx bypassed them); Phase 3 PR 18 deleted the archived project source trees. What remained on disk was stale bin/obj residue from pre-deletion builds -- git never tracked those, so removing them from the working tree is cosmetic only (no source-file diff in this PR). What this PR actually changes: V1_ARCHIVE_STATUS.md is rewritten from 'Deletion plan (Phase 2 PR 3)' pre-work prose to a CLOSED retrospective that (a) lists all five v1 directories as deleted with check-marks (src/OtOpcUa.Host, src/Historian.Aveva, tests/Historian.Aveva.Tests, tests/Tests.v1Archive, tests/IntegrationTests), (b) names the parity-bar tests that now fill the role the 494 v1 tests originally held (Driver.Galaxy.E2E cross-FX subprocess parity + stability-findings regression, per-component *.Tests projects, Driver.Modbus.IntegrationTests, LiveStack/ smoke tests), and (c) gives the closure timeline connecting PR 2 -> Phase 3 PR 18 -> this PR 61. Also added the Modbus TCP driver family as parity coverage that didn't exist in v1 (DL205 + S7-1500 + Mitsubishi MELSEC via pymodbus sim). Stream D (retire legacy Host) has been effectively done since Phase 3 PR 18; Stream E (parity validation) is done since PR 2 landed the Driver.Galaxy.E2E project with HostSubprocessParityTests + HierarchyParityTests + StabilityFindingsRegressionTests. This PR exists to definitively close the two pending Phase 2 tasks on the task list and give future-me (or anyone picking up Phase 2 retrospectives) a single 'what actually happened' doc instead of a 'what we plan to do' prose that didn't match reality. dotnet build ZB.MOM.WW.OtOpcUa.slnx: 0 errors, 200 warnings (all xunit1051 cancellation-token analyzer advisories, unchanged from v2 tip). No test regressions -- no source code changed. 2026-04-18 23:20:54 -04:00
be8261a4ac Merge pull request 'Phase 3 PR 60 -- Mitsubishi MELSEC quirk integration tests' (#59) from phase-3-pr60-mitsubishi-quirk-tests into v2 2026-04-18 23:10:36 -04:00
65de2b4a09 Merge pull request 'Phase 3 PR 59 -- MelsecAddress helper with family selector (hex vs octal X/Y)' (#58) from phase-3-pr59-melsec-address-helper into v2 2026-04-18 23:10:29 -04:00
fccb566a30 Merge pull request 'Phase 3 PR 58 -- Mitsubishi MELSEC pymodbus profile + smoke' (#57) from phase-3-pr58-mitsubishi-sim-profile into v2 2026-04-18 23:10:21 -04:00
9ccc7338b8 Merge pull request 'Phase 3 PR 57 -- S7 byte-order + fingerprint integration tests' (#56) from phase-3-pr57-s7-quirk-tests into v2 2026-04-18 23:10:14 -04:00
e33783e042 Merge pull request 'Phase 3 PR 56 -- Siemens S7-1500 pymodbus profile + smoke' (#55) from phase-3-pr56-s7-sim-profile into v2 2026-04-18 23:10:07 -04:00
Joseph Doherty
a44fc7a610 Phase 3 PR 60 -- Mitsubishi MELSEC quirk integration tests against mitsubishi pymodbus profile. Seven facts in MitsubishiQuirkTests covering the quirks documented in docs/v2/mitsubishi.md that are testable end-to-end via pymodbus: (1) Mitsubishi_D0_fingerprint_reads_0x1234 -- MELSEC operators reserve D0 as a fingerprint word so Modbus clients can verify they're hitting the right Device Assignment block; test reads HR[0]=0x1234 via DRegisterToHolding('D0') helper. (2) Mitsubishi_Float32_CDAB_decodes_1_5f_from_D100 -- reads HR[100..101] with WordSwap AND BigEndian; asserts WordSwap==1.5f AND BigEndian!=1.5f, proving (a) MELSEC uses CDAB default same as DL260, (b) opposite of S7 ABCD, (c) driver flag is not a no-op. (3) Mitsubishi_D10_is_binary_not_BCD -- reads HR[10]=0x04D2 as Int16 and asserts value 1234 (binary decode), contrasting with DL205's BCD-by-default convention. (4) Mitsubishi_D10_as_BCD_throws_because_nibble_is_non_decimal -- reads same HR[10] as Bcd16 and asserts StatusCode != 0 because nibble 0xD fails BCD validation; proves the BCD decoder fails loud when the tag config is wrong rather than silently returning garbage. (5) Mitsubishi_QLiQR_X210_hex_maps_to_DI_528_reads_ON -- reads FC02 at the MelsecAddress.XInputToDiscrete('X210', Q_L_iQR)-resolved address (=528 decimal) and asserts ON; proves the hex-parsing path end-to-end. (6) Mitsubishi_family_trap_X20_differs_on_Q_vs_FX -- unit-level proof in the integration file so the headline family trap is visible to anyone filtering by Device=Mitsubishi. (7) Mitsubishi_M512_maps_to_coil_512_reads_ON -- reads FC01 at MRelayToCoil('M512')=512 (decimal) and asserts ON; proves the decimal M-relay path. Test fixture pattern: single MitsubishiQuirkTests class with a shared ShouldRun + NewDriverAsync helper rather than per-quirk classes (contrast with DL205's per-quirk splits). MELSEC per-model differentiation is handled by MelsecFamily enum on the helper rather than per-PR -- so one quirk file + one family enum covers Q/L/iQ-R/FX/iQ-F, and a new PLC family just adds an enum case instead of a new test class. 8/8 Mitsubishi integration tests pass (1 smoke + 7 quirk). 176/176 Modbus.Tests unit suite still green. S7 + DL205 integration tests can be run against their respective profiles by swapping MODBUS_SIM_PROFILE and restarting the pymodbus sim -- each family gates on its profile env var so no cross-family test pollution. 2026-04-18 23:07:00 -04:00
Joseph Doherty
d4c1873998 Phase 3 PR 59 -- MelsecAddress helper for MELSEC X/Y hex-vs-octal family trap + D/M bank bases. Adds MelsecAddress static class with XInputToDiscrete, YOutputToCoil, MRelayToCoil, DRegisterToHolding helpers and a MelsecFamily enum {Q_L_iQR, F_iQF} that drives whether X/Y addresses are parsed as hex (Q-series convention) or octal (FX-series convention). This is the #1 MELSEC driver bug source per docs/v2/mitsubishi.md: the string 'X20' on a MELSEC-Q means DI 32 (hex 0x20) while the same string on an FX3U means DI 16 (octal 0o20). The helper forces the caller to name the family explicitly; no 'sensible default' because wrong defaults just move the bug. Key design decisions: (1) Family is an enum argument, not a helper-level static-selector, because real deployments have BOTH Q-series and FX-series PLCs on the same gateway -- one driver instance per device means family must be per-tag, not per-driver. (2) Bank base is a ushort argument defaulting to 0. Real QJ71MT91/LJ71MT91 assignment blocks commonly place X at DI 8192+, Y at coil 8192+, etc. to leave the low-address range for D-registers; the helper takes the site's configured base as runtime config rather than a compile-time constant. Matches the 'driver opt-in per tag' pattern DirectLogicAddress established for DL260. (3) M-relay and D-register are DECIMAL on every MELSEC family -- docs explicitly; the MELSEC confusion is only about X/Y, not about data registers or internal relays. Helpers reject non-numeric M/D addresses and honor bank bases the same way. (4) Parser walks digits manually for both hex and octal (instead of int.Parse with NumberStyles) so non-hex / non-octal characters give a clear ArgumentException with the offending char + family name. Prevents a subtle class of bugs where int.Parse('X20', Hex) silently returns 32 even for F_iQF callers. Unit tests (MelsecAddressTests, 34 facts): XInputToDiscrete_QLiQR_parses_hex theory (X0, X9, XA, XF, X10, X20, X1FF + lowercase); XInputToDiscrete_FiQF_parses_octal theory (X0, X7, X10, X20, X777); YOutputToCoil equivalents; Same_address_string_decodes_differently_between_families (the headline trap, X20 => 32 on Q vs 16 on FX); reject-non-octal / reject-non-hex / reject-empty / overflow facts; honors-bank-base for X and M and D. 176/176 Modbus.Tests pass (143 prior + 34 new Melsec). No driver core changes -- this is purely a new helper class in the Driver.Modbus project. PR 60 wires it into integration tests against the mitsubishi pymodbus profile. 2026-04-18 23:04:52 -04:00
Joseph Doherty
f52b7d8979 Phase 3 PR 58 -- Mitsubishi MELSEC pymodbus profile + smoke integration test. Adds tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Pymodbus/mitsubishi.json modelling a representative MELSEC Modbus Device Assignment block: D0..D1023 -> HR[0..1023], M-relay marker at coil 512 (cell 32) and X-input marker at DI 528 (cell 33). Covers the canonical MELSEC quirks from docs/v2/mitsubishi.md: D0 fingerprint at HR[0]=0x1234 so clients can verify the assignment parameter block is in effect, scratch HR 200..209 mirroring dl205/s7_1500/standard scratch range for uniform smoke tests, Float32 1.5f at HR[100..101] in CDAB word order (HR[100]=0, HR[101]=0x3FC0) -- same as DL260, OPPOSITE of S7 ABCD, confirms MELSEC-family driver profile default must be ByteOrder.WordSwap. Int32 0x12345678 CDAB at HR[300..301]. D10 = binary 1234 (0x04D2) proves MELSEC is BINARY-by-default (opposite of DL205 BCD-by-default quirk) -- reading D10 with Bcd16 data type would throw InvalidDataException on nibble 0xD. M-relay marker cell moved to address 32 (coil 512) to avoid shared-block collision with D0 uint16 marker at cell 0; pymodbus shared-blocks=true semantics allow only one type per cell index, so Modbus-coil-0 can't coexist with Modbus-HR-0 on the same sim. Same pattern we applied to dl205 profile (X-input bank at cell 1, not cell 0, to coexist with V0 marker). Adds Mitsubishi/ test directory with MitsubishiProfile.cs (SmokeHoldingRegister=200, SmokeHoldingValue=7890, BuildOptions with probe-disabled + 2s timeout) and MitsubishiSmokeTests.cs (Mitsubishi_roundtrip_write_then_read_of_holding_register single fact that writes 7890 at HR[200] then reads back, gated on MODBUS_SIM_PROFILE=mitsubishi). csproj copies Mitsubishi/** as PreserveNewest. Per-model differences (FX5U firmware gate, QJ71MT91 FC22/23 absence, FX/iQ-F octal vs Q/L/iQ-R hex X-addressing) are handled in the MelsecAddress helper (PR 59) + per-model test classes (PR 60). Verified: smoke 1/1 passes against live mitsubishi sim. Prior S7 tests 4/4 still green when swapped back. Modbus.Tests unit suite 143/143. 2026-04-18 23:02:29 -04:00
Joseph Doherty
b54724a812 Phase 3 PR 57 -- S7 byte-order + fingerprint integration tests against s7_1500 pymodbus profile. Three facts in new S7_ByteOrderTests class: (1) S7_Float32_ABCD_decodes_1_5f_from_HR100 reads HR[100..101] with ModbusByteOrder.BigEndian AND with WordSwap on the same wire bytes; asserts BigEndian==1.5f AND WordSwap!=1.5f -- proving both that Siemens S7 stores Float32 in ABCD word order (opposite of DL260 CDAB) and that the ByteOrder flag is not a no-op on the same wire buffer. (2) S7_Int32_ABCD_decodes_0x12345678_from_HR300 reads HR[300]=0x1234 + HR[301]=0x5678 with BigEndian and asserts the reassembled Int32 = 0x12345678; documents the contrast with DL260 CDAB Int32 encoding. (3) S7_DB1_fingerprint_marker_at_HR0_reads_0xABCD reads HR[0]=0xABCD -- real MB_SERVER deployments reserve DB1.DBW0 as a fingerprint so clients can verify they're pointing at the right DB, protecting against typos in the MB_SERVER.MB_HOLD_REG.DB_number parameter. No driver code changes -- the ByteOrder.BigEndian path has existed since PR 24; this PR exists to lock in the S7-specific semantics at the integration level so future refactors of NormalizeWordOrder can't silently break S7. All 3 tests gate on MODBUS_SIM_PROFILE=s7_1500 so they skip cleanly against dl205 or standard profiles. Verified end-to-end: 4/4 S7 integration tests pass (1 smoke from PR 56 + 3 new). No regression in driver unit tests. Per the per-quirk-PR plan: the S7 quirks NOT testable via pymodbus sim (MB_SERVER STATUS 0x8383 optimized-DB behavior, port-per-connection semantics, CP 343-1 Lean license rejection, STOP-mode non-determinism) remain in docs/v2/s7.md as design guidance for driver users rather than automated tests -- they're TIA-Portal-side or CP-hardware-side behaviors that pymodbus cannot reproduce without custom Python actions. 2026-04-18 22:58:44 -04:00
Joseph Doherty
10c724b5b6 Phase 3 PR 56 -- Siemens S7-1500 pymodbus profile + smoke integration test. Adds tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Pymodbus/s7_1500.json modelling the SIMATIC S7-1500 + MB_SERVER default deployment documented in docs/v2/s7.md: DB1.DBW0 = 0xABCD fingerprint marker (operators reserve this so clients can verify they're talking to the right DB), scratch HR range 200..209 for write-roundtrip tests mirroring dl205.json + standard.json, Float32 1.5f at HR[100..101] in ABCD word order (high word first -- OPPOSITE of DL260 CDAB), Int32 0x12345678 at HR[300..301] in ABCD. Also seeds a coil at bit-addr 400 (= cell 25 bit 0) and a discrete input at bit-addr 500 (= cell 31 bit 0) so future S7-specific tests for FC01/FC02 have stable markers. shared blocks=true to match the proven dl205.json pattern (pymodbus's bits/uint16 cells coexist cleanly when addresses don't collide). Write list references cells (0, 25, 100-101, 200-209, 300-301), not bit addresses -- pymodbus's write-range entries are cell-indexed, not bit-indexed. Adds tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/S7/ directory with S7_1500Profile.cs (mirrors DL205Profile pattern: SmokeHoldingRegister=200, SmokeHoldingValue=4321, BuildOptions tags + probe-disabled + 2s timeout) and S7_1500SmokeTests.cs (single fact S7_1500_roundtrip_write_then_read_of_holding_register that writes SmokeHoldingValue then reads it back, asserting both write status 0 and read status 0 + value equality). Gates on MODBUS_SIM_PROFILE=s7_1500 so the test skips cleanly against other profiles. csproj updated to copy S7/** to test output as PreserveNewest (pattern matching DL205/**). Pymodbus/serve.ps1 ValidateSet extended from {standard,dl205} to {standard,dl205,s7_1500,mitsubishi} -- mitsubishi.json lands in PR 58 but the validator slot is claimed now so the serve.ps1 diff is one line in this PR and zero lines in future PRs. Verified end-to-end: smoke test 1/1 passes against the running pymodbus s7_1500 profile (localhost:5020 FC06 write of 4321 at HR[200] + FC03 read back). 143/143 Modbus.Tests pass, no regression in driver code because this PR is purely test-asset. Per-quirk S7 integration tests (ABCD word order default, FC23 IllegalFunction, MB_SERVER STATUS 0x8383 behaviour, port-per-connection semantics) land in PR 57+. 2026-04-18 22:57:03 -04:00
8c89d603e8 Merge pull request 'Phase 3 PR 55 -- Mitsubishi MELSEC Modbus TCP quirks research doc' (#54) from phase-3-pr55-mitsubishi-research-doc into v2 2026-04-18 22:54:09 -04:00
299bd4a932 Merge pull request 'Phase 3 PR 54 -- Siemens S7 Modbus TCP quirks research doc' (#53) from phase-3-pr54-s7-research-doc into v2 2026-04-18 22:54:02 -04:00
Joseph Doherty
c506ea298a Phase 3 PR 55 -- Mitsubishi MELSEC Modbus TCP quirks research document. 451-line doc at docs/v2/mitsubishi.md mirroring the docs/v2/dl205.md template for the MELSEC family (Q-series + QJ71MT91, L-series + LJ71MT91, iQ-R + RJ71EN71, iQ-R built-in Ethernet, iQ-F FX5U built-in, FX3U + FX3U-ENET / FX3U-ENET-P502, FX3GE built-in). Like Siemens S7, MELSEC Modbus is a patchwork of per-site-configured add-on modules rather than a fixed firmware stack, but the MELSEC-specific traps are different enough to warrant their own document. Key findings worth flagging for the PR 58+ implementation track: (1) MODULE NAMING TRAP -- QJ71MB91 is SERIAL RTU, not TCP. The Q-series TCP module is QJ71MT91. Driver docs + config UI should surface this clearly because the confusion costs operators hours when they try to connect to an RS-232 module via Ethernet. (2) NO CANONICAL MAPPING -- every MELSEC Modbus site has a unique 'Modbus Device Assignment Parameter' block of up to 16 assignments (each binding a MELSEC device range like D0..D1023 to a Modbus-address range); the driver must treat the mapping as runtime config, not device-family profile. (3) X/Y BASE DEPENDS ON FAMILY -- Q/L/iQ-R use HEX notation for X/Y (X20 = decimal 32), FX/iQ-F use OCTAL (X20 = decimal 16, same as DL260); iQ-F has a GX Works3 project toggle that can flip this. Single biggest off-by-N source in MELSEC driver code -- driver address helper must take a family selector. (4) Word order CDAB across Q/L/iQ-R/iQ-F by default (CPU-level, not module-level) -- no user-configurable swap on the server side. FX5U's SWAP instruction is for CLIENT mode only. Driver Mitsubishi profile default must be ByteOrder.WordSwap, matching DL260 but OPPOSITE of Siemens S7. (5) D-registers are BINARY by default (opposite of DL205's BCD-by-default). FNC 18 BCD / FNC 19 BIN instructions confirm binary-by-default in the ladder. Caller must explicitly opt-in to Bcd16/Bcd32 tags when the ladder stores BCD, same pattern as DL205 but the default is inverted. (6) FX5U FIRMWARE GATE -- needs firmware >= 1.060 for native Modbus TCP server; older firmware is client-only. Surface a clear capability error on connect. (7) FX3U PORT 502 SPLIT -- the standard FX3U-ENET cannot bind port 502 (lower port range restricted on the firmware); only FX3U-ENET-P502 can. FX3U-ENET-ADP has no Modbus at all and is a common operator mis-purchase -- driver should surface 'module does not support Modbus' as a distinct error, not 'connection refused'. (8) QJ71MT91 does NOT support FC22 (Mask Write) or FC23 (Read-Write Multiple). iQ-R and iQ-F do. Driver bulk-read optimization must gate on module capability. (9) MAX CONNECTIONS -- 16 simultaneous on Q/L/iQ-R, 8 on FX5U and FX3U-ENET. (10) STOP-mode writes -- configurable on Q/L/iQ-R/iQ-F (default = accept writes even in STOP), always rejected with exception 04 on FX3U-ENET. Per-model test differentiation section names the tests Mitsubishi_QJ71MT91_*, Mitsubishi_FX5U_*, Mitsubishi_FX3U_ENET_*, with a shared Mitsubishi_Common_* fixture for CDAB-word-order + binary-not-BCD + standard-exception-codes tests. 17 cited references including primary Mitsubishi manuals (SH-080446 for QJ71MT91, JY997D56101 for FX5, SH-081259 for iQ-R Ethernet, JY997D18101 for FX3U-ENET) plus Ignition / Kepware / Fernhill / HMS third-party driver release notes. Three unconfirmed rumours flagged explicitly: iQ-R RJ71EN71 early firmware rumoured ABCD word order (no primary source), QJ71MT91 firmware < 2010-05 FC15 odd-byte-count truncation (forum report only), FX3U-ENET firmware < 1.14 out-of-order TxId echoes under load (unreproducible on bench). Pure documentation PR -- no code, no tests. Per-quirk implementation lands in PRs 58+. Research conducted 2026-04-18. 2026-04-18 22:51:28 -04:00
Joseph Doherty
9e2b5b330f Phase 3 PR 54 -- Siemens S7 Modbus TCP quirks research document. 485-line doc at docs/v2/s7.md mirroring the docs/v2/dl205.md template for the Siemens SIMATIC S7 family (S7-1200 / S7-1500 / S7-300 / S7-400 / ET 200SP / CP 343-1 / CP 443-1 / CP 343-1 Lean / MODBUSPN). Siemens S7 is fundamentally different from DL260: there is no fixed Modbus memory map baked into firmware -- every deployment runs MB_SERVER (S7-1200/1500/ET 200SP), MODBUSCP (S7-300/400 + CP), or MODBUSPN (S7-300/400 PN) library blocks wired up to user DBs via the MB_HOLD_REG / ADDR parameters. The driver's job is therefore to handle per-site CONFIG rather than per-family QUIRKS, and the doc makes that explicit. Key findings worth flagging for the PR 56+ implementation track: (1) S7 has no fixed memory map -- must accept per-site DriverConfig, cannot assume vendor-standard layout. (2) MB_SERVER requires NON-optimized DBs in TIA Portal; optimized DBs cause the library to return STATUS 0x8383 on every access -- the single most common S7 Modbus deployment bug in the field. (3) Word order is ABCD by default (big-endian bytes + big-endian words) across all Siemens S7 Modbus paths, which is the OPPOSITE of DL260 CDAB -- the Modbus driver's S7 profile default must be ByteOrder.BigEndian, not WordSwap. (4) MB_SERVER listens on ONE port per FB instance; multi-client support requires running MB_SERVER on 502 / 503 / 504 / ... simultaneously -- most clients assume port 502 multiplexes, which is wrong on S7. (5) CP 343-1 Lean is SERVER-ONLY and requires the separate 2XV9450-1MB00 MODBUS TCP CP library license; client mode calls return immediate error on Lean. (6) MB_SERVER does NOT filter Unit ID, accepts any value. Means the driver can't use Unit ID to detect 'direct vs gateway' topology. (7) FC23 Read-Write Multiple, FC22 Mask Write, FC20/21 File Records, FC43 Device Identification all return exception 01 Illegal Function on every S7 variant -- the driver MUST NOT attempt bulk-read optimisation via FC23 when talking to S7. (8) STOP-mode read/write behaviour is non-deterministic across firmware bands: reads may return cached data (library internal buffer), writes may succeed-silently or return exception 04 depending on CPU firmware version -- flagged as 'driver treats both as unavailable, do not distinguish'. Unconfirmed rumours flagged separately: 'V2.0+ reverses float byte order' claim (cited but not reproduced), STOP-mode caching location (folklore, no primary source). Per-model test differentiation section names the tests as S7_<model>_<behavior> matching the DL205 template convention (e.g. S7_1200_MB_SERVER_requires_non_optimized_DB, S7_343_1_Lean_rejects_client_mode, S7_FC23_returns_IllegalFunction). 31 cited references across the Siemens Industry Online Support entry-ID system (68011496 for MB_SERVER FAQ, etc.), TIA Portal library manuals, and three third-party driver vendor release notes (Kepware, Ignition, FactoryTalk). This is a pure documentation PR -- no code, no tests, no csproj changes. Per-quirk implementation lands in PRs 56+. Research conducted 2026-04-18 against latest publicly-available Siemens documentation; STOP-mode behaviour and MB_SERVER versioning specifically cross-checked against Siemens forum answers from 2024-2025. 2026-04-18 22:50:51 -04:00
d5c6280333 Merge pull request 'Phase 3 PR 53 -- Transport reconnect-on-drop + SO_KEEPALIVE (DL260 no-keepalive quirk)' (#52) from phase-3-pr53-dl205-reconnect into v2 2026-04-18 22:35:40 -04:00
476ce9b7c5 Merge pull request 'Phase 3 PR 52 -- Modbus exception-code -> OPC UA StatusCode translation' (#51) from phase-3-pr52-dl205-exception-codes into v2 2026-04-18 22:35:33 -04:00
954bf55d28 Merge pull request 'Phase 3 PR 51 -- DL260 X-input FC02 discrete-input mapping end-to-end test' (#50) from phase-3-pr51-dl205-xinput into v2 2026-04-18 22:35:25 -04:00
9fb3cf7512 Merge pull request 'Phase 3 PR 50 -- DL260 bit-memory helpers (Y/C/X/SP) + coil integration tests' (#49) from phase-3-pr50-dl205-coil-mapping into v2 2026-04-18 22:35:18 -04:00
Joseph Doherty
793c787315 Phase 3 PR 53 -- Transport reconnect-on-drop + SO_KEEPALIVE for DL205 no-keepalive quirk. AutomationDirect H2-ECOM100 does NOT send TCP keepalives per docs/v2/dl205.md behavioral-oddities section -- any NAT/firewall device between the gateway and the PLC can silently close an idle socket after 2-5 minutes of inactivity. The PLC itself never notices and the first SendAsync after the drop would previously surface as IOException / EndOfStreamException / SocketException to the caller even though the PLC is perfectly healthy. PR 53 makes ModbusTcpTransport survive mid-session socket drops: SendAsync wraps the previous body as SendOnceAsync; on the first attempt, if the failure is a socket-layer error (IOException, SocketException, EndOfStreamException, ObjectDisposedException) AND autoReconnect is enabled (default true), the transport tears down the dead socket, calls ConnectAsync to re-establish, and resends the PDU exactly once. Deliberately single-retry -- further failures propagate so the driver health surface reflects the real state, no masking a dead PLC. Protocol-layer failures (e.g. ModbusException with exception code 02) are specifically NOT caught by the reconnect path -- they would just come back with the same exception code after the reconnect, so retrying is wasted wire time. Socket-level vs protocol-level is a discriminator inside IsSocketLevelFailure. Also enables SO_KEEPALIVE on the TcpClient with aggressive timing: TcpKeepAliveTime=30s, TcpKeepAliveInterval=10s, TcpKeepAliveRetryCount=3. Total time-to-detect-dead-socket = 30 + 10*3 = 60s, vs the Windows default 2-hour idle + 9 retries = 2h40min. Best-effort: older OSes that don't expose the fine-grained keepalive knobs silently skip them (catch {}). New ModbusDriverOptions.AutoReconnect bool (default true) threads through to the default transport factory in ModbusDriver -- callers wanting the old 'fail loud on drop' behavior can set AutoReconnect=false, or use a custom transportFactory that ignores the option. Unit tests: ModbusTcpReconnectTests boots a FlakeyModbusServer in-process (real TcpListener on loopback) that serves one valid FC03 response then forcibly shuts down the socket. Transport_recovers_from_mid_session_drop_and_retries_successfully issues two consecutive SendAsync calls and asserts both return valid PDUs -- the second must trigger the reconnect path transparently. Transport_without_AutoReconnect_propagates_drop_to_caller asserts the legacy behavior when the opt-out is taken. Validates real socket semantics rather than mocked exceptions. 142/142 Modbus.Tests pass (113 prior + 2 mapper + 2 reconnect + 25 accumulated across PRs 45-52); 11/11 DL205 integration tests still pass with MODBUS_SIM_PROFILE=dl205 -- no regression from the transport change. 2026-04-18 22:32:13 -04:00
Joseph Doherty
cde018aec1 Phase 3 PR 52 -- Modbus exception-code -> OPC UA StatusCode translation. Before this PR every server-side Modbus exception AND every transport-layer failure collapsed to BadInternalError (0x80020000) in the driver's Read/Write results, making field diagnosis 'is this a tag misconfig or a driver bug?' impossible from the OPC UA client side. PR 52 adds a MapModbusExceptionToStatus helper that translates per spec: 01 Illegal Function -> BadNotSupported (0x803D0000); 02 Illegal Data Address -> BadOutOfRange (0x803C0000); 03 Illegal Data Value -> BadOutOfRange; 04 Server Failure -> BadDeviceFailure (0x80550000); 05/06 Acknowledge/Busy -> BadDeviceFailure; 0A/0B Gateway -> BadCommunicationError (0x80050000); unknown -> BadInternalError fallback. Non-Modbus failures (socket drop, timeout, malformed frame) in ReadAsync are now distinguished from tag-level faults: they map to BadCommunicationError so operators check network/PLC reachability rather than tag definitions. Why per-DL205: docs/v2/dl205.md documents DL205/DL260 returning only codes 01-04 with specific triggers -- exception 04 specifically means 'CPU in PROGRAM mode during a protected write', which is operator-recoverable by switching the CPU to RUN; surfacing it as BadDeviceFailure (not BadInternalError) makes the fix obvious. Changes in ModbusDriver: Read catch-chain now ModbusException first (-> mapper), generic Exception second (-> BadCommunicationError); Write catch-chain same pattern but generic Exception stays BadInternalError because write failures can legitimately come from EncodeRegister (out-of-range value) which is a driver-layer fault. Unit tests: MapModbusExceptionToStatus theory exercising every code in the table including the 0xFF fallback; Read_surface_exception_02_as_BadOutOfRange with an ExceptionRaisingTransport that forces code 02; Write_surface_exception_04_as_BadDeviceFailure for CPU-mode faults; Read_non_modbus_failure_maps_to_BadCommunicationError with a NonModbusFailureTransport that raises EndOfStreamException. 115/115 Modbus.Tests pass. Integration test: DL205ExceptionCodeTests.DL205_FC03_at_unmapped_register_returns_BadOutOfRange reads HR[16383] which is beyond the seeded uint16 cells on the dl205.json profile; pymodbus returns exception 02 and the driver surfaces BadOutOfRange. 11/11 DL205 integration tests pass with MODBUS_SIM_PROFILE=dl205. 2026-04-18 22:28:37 -04:00
349 changed files with 43405 additions and 3284 deletions

View File

@@ -9,9 +9,16 @@
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.Modbus/ZB.MOM.WW.OtOpcUa.Driver.Modbus.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.S7/ZB.MOM.WW.OtOpcUa.Driver.S7.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/ZB.MOM.WW.OtOpcUa.Driver.AbCip.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.Shared/ZB.MOM.WW.OtOpcUa.Client.Shared.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.CLI/ZB.MOM.WW.OtOpcUa.Client.CLI.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Client.UI/ZB.MOM.WW.OtOpcUa.Client.UI.csproj"/>
<Project Path="src/ZB.MOM.WW.OtOpcUa.Analyzers/ZB.MOM.WW.OtOpcUa.Analyzers.csproj"/>
</Folder>
<Folder Name="/tests/">
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests.csproj"/>
@@ -26,8 +33,16 @@
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests/ZB.MOM.WW.OtOpcUa.Client.Shared.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests/ZB.MOM.WW.OtOpcUa.Client.CLI.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Client.UI.Tests/ZB.MOM.WW.OtOpcUa.Client.UI.Tests.csproj"/>
<Project Path="tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests.csproj"/>
</Folder>
</Solution>

1
_p54.json Normal file
View File

@@ -0,0 +1 @@
{"title":"Phase 3 PR 54 -- Siemens S7 Modbus TCP quirks research doc","body":"## Summary\n\nAdds `docs/v2/s7.md` (485 lines) covering Siemens SIMATIC S7 family Modbus TCP behavior. Mirrors the `docs/v2/dl205.md` template for future per-quirk implementation PRs.\n\n## Key findings for the implementation track\n\n- **No fixed memory map** — every S7 Modbus server is user-wired via `MB_SERVER`/`MODBUSCP`/`MODBUSPN` library blocks. Driver must accept per-site config, not assume a vendor layout.\n- **MB_SERVER requires non-optimized DBs** (STATUS `0x8383` if optimized). Most common field bug.\n- **Word order default = ABCD** (opposite of DL260). Driver's S7 profile default must be `ByteOrder.BigEndian`, not `WordSwap`.\n- **One port per MB_SERVER instance** — multi-client requires parallel FBs on 503/504/… Most clients assume port 502 multiplexes (wrong on S7).\n- **CP 343-1 Lean is server-only**, requires the `2XV9450-1MB00` license.\n- **FC20/21/22/23/43 all return Illegal Function** on every S7 variant — driver must not attempt FC23 bulk-read optimization for S7.\n- **STOP-mode behavior non-deterministic** across firmware bands — treat both read/write STOP-mode responses as unavailable.\n\nTwo items flagged as unconfirmed rumour (V2.0+ float byte-order claim, STOP-mode caching location).\n\nNo code, no tests — implementation lands in PRs 56+.\n\n## Test plan\n- [x] Doc renders as markdown\n- [x] 31 citations present\n- [x] Section structure matches dl205.md template","head":"phase-3-pr54-s7-research-doc","base":"v2"}

1
_p55.json Normal file
View File

@@ -0,0 +1 @@
{"title":"Phase 3 PR 55 -- Mitsubishi MELSEC Modbus TCP quirks research doc","body":"## Summary\n\nAdds `docs/v2/mitsubishi.md` (451 lines) covering MELSEC Q/L/iQ-R/iQ-F/FX3U Modbus TCP behavior. Mirrors `docs/v2/dl205.md` template for per-quirk implementation PRs.\n\n## Key findings for the implementation track\n\n- **Module naming trap** — `QJ71MB91` is SERIAL RTU, not TCP. TCP module is `QJ71MT91`. Surface clearly in driver docs.\n- **No canonical mapping** — per-site 'Modbus Device Assignment Parameter' block (up to 16 entries). Treat mapping as runtime config.\n- **X/Y hex vs octal depends on family** — Q/L/iQ-R use HEX (X20 = decimal 32); FX/iQ-F use OCTAL (X20 = decimal 16). Helper must take a family selector.\n- **Word order CDAB default** across all MELSEC families (opposite of Siemens S7). Driver Mitsubishi profile default: `ByteOrder.WordSwap`.\n- **D-registers binary by default** (opposite of DL205's BCD default). Caller opts in to `Bcd16`/`Bcd32` when ladder uses BCD.\n- **FX5U needs firmware ≥ 1.060** for Modbus TCP server — older is client-only.\n- **FX3U-ENET vs FX3U-ENET-P502 vs FX3U-ENET-ADP** — only the middle one binds port 502; the last has no Modbus at all. Common operator mis-purchase.\n- **QJ71MT91 does NOT support FC22 / FC23** — iQ-R / iQ-F do. Bulk-read optimization must gate on capability.\n- **STOP-mode writes configurable** on Q/L/iQ-R/iQ-F (default accept), always rejected on FX3U-ENET.\n\nThree unconfirmed rumours flagged separately.\n\nNo code, no tests — implementation lands in PRs 58+.\n\n## Test plan\n- [x] Doc renders as markdown\n- [x] 17 citations present\n- [x] Per-model test naming matrix included (`Mitsubishi_QJ71MT91_*`, `Mitsubishi_FX5U_*`, `Mitsubishi_FX3U_ENET_*`, shared `Mitsubishi_Common_*`)","head":"phase-3-pr55-mitsubishi-research-doc","base":"v2"}

20
ci/ab-server.lock.json Normal file
View File

@@ -0,0 +1,20 @@
{
"_comment": "Pinned libplctag release used by tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/AbServerFixture. ab_server.exe ships inside the *_tools.zip asset on every GitHub release. See docs/v2/test-data-sources.md §2.CI for the GitHub Actions step that consumes this file.",
"repo": "libplctag/libplctag",
"tag": "v2.6.16",
"published": "2026-03-29",
"assets": {
"windows-x64": {
"file": "libplctag_2.6.16_windows_x64_tools.zip",
"sha256": "9b78a3dee73d9cd28ca348c090f453dbe3ad9d07ad6bf42865a9dc3a79bc2232"
},
"windows-x86": {
"file": "libplctag_2.6.16_windows_x86_tools.zip",
"sha256": "fdfefd58b266c5da9a1ded1a430985e609289c9e67be2544da7513b668761edf"
},
"windows-arm64": {
"file": "libplctag_2.6.16_windows_arm64_tools.zip",
"sha256": "d747728e4c4958bb63b4ac23e1c820c4452e4778dfd7d58f8a0aecd5402d4944"
}
}
}

View File

@@ -1,82 +1,72 @@
# Address Space
The address space maps the Galaxy object hierarchy and attribute definitions into an OPC UA browse tree. `LmxNodeManager` builds the tree from data queried by `GalaxyRepositoryService`, while `AddressSpaceBuilder` provides a testable in-memory model of the same structure.
Each driver's browsable subtree is built by streaming nodes from the driver's `ITagDiscovery.DiscoverAsync` implementation into an `IAddressSpaceBuilder`. `GenericDriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs`) owns the shared orchestration; `DriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) implements `IAddressSpaceBuilder` against the OPC Foundation stack's `CustomNodeManager2`. The same code path serves Galaxy object hierarchies, Modbus PLC registers, AB CIP tags, TwinCAT symbols, FOCAS CNC parameters, and OPC UA Client aggregations — Galaxy is one driver of seven, not the driver.
## Root ZB Folder
## Driver root folder
Every address space starts with a single root folder node named `ZB` (NodeId `ns=1;s=ZB`). This folder is added under the standard OPC UA `Objects` folder via an `Organizes` reference. The reverse reference is registered through `MasterNodeManager.AddReferences` because `BuildAddressSpace` runs after `CreateAddressSpace` has already consumed the external references dictionary.
Every driver's subtree starts with a root `FolderState` under the standard OPC UA `Objects` folder, wired with an `Organizes` reference. `DriverNodeManager.CreateAddressSpace` creates this folder with `NodeId = ns;s={DriverInstanceId}`, `BrowseName = {DriverInstanceId}`, and `EventNotifier = SubscribeToEvents | HistoryRead` so alarm and history-event subscriptions can target the root. The namespace URI is `urn:OtOpcUa:{DriverInstanceId}`.
The root folder has `EventNotifier = SubscribeToEvents` enabled so alarm events propagate up to clients subscribed at the root level.
## IAddressSpaceBuilder surface
## Area Folders vs Object Nodes
`IAddressSpaceBuilder` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAddressSpaceBuilder.cs`) offers three calls:
Galaxy objects fall into two categories based on `template_definition.category_id`:
- `Folder(browseName, displayName)` — creates a child `FolderState` and returns a child builder scoped to it.
- `Variable(browseName, displayName, DriverAttributeInfo attributeInfo)` — creates a `BaseDataVariableState` and returns an `IVariableHandle` the driver keeps for alarm wiring.
- `AddProperty(browseName, DriverDataType, value)` — attaches a `PropertyState` for static metadata (e.g. equipment identification fields).
- **Areas** (`category_id = 13`) become `FolderState` nodes with `FolderType` type definition and `Organizes` references. They represent logical groupings in the Galaxy hierarchy (e.g., production lines, cells).
- **Non-area objects** (AppEngine, Platform, UserDefined, etc.) become `BaseObjectState` nodes with `BaseObjectType` type definition and `HasComponent` references. These represent runtime automation objects that carry attributes.
Drivers drive ordering. Typical pattern: root → folder per equipment → variables per tag. `GenericDriverNodeManager` calls `DiscoverAsync` once on startup and once per rediscovery cycle.
Both node types use `contained_name` as the browse name. When `contained_name` is null or empty, `tag_name` is used as a fallback.
## DriverAttributeInfo → OPC UA variable
## Variable Nodes for Attributes
Each variable carries a `DriverAttributeInfo` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs`):
Each Galaxy attribute becomes a `BaseDataVariableState` node under its parent object. The variable is configured with:
| Field | OPC UA target |
|---|---|
| `FullName` | `NodeId.Identifier` — used as the driver-side lookup key for Read/Write/Subscribe |
| `DriverDataType` | mapped to a built-in `DataTypeIds.*` NodeId via `DriverNodeManager.MapDataType` |
| `IsArray` | `ValueRank = OneDimension` when true, `Scalar` otherwise |
| `ArrayDim` | declared array length, carried through as metadata |
| `SecurityClass` | stored in `_securityByFullRef` for `WriteAuthzPolicy` gating on write |
| `IsHistorized` | flips `AccessLevel.HistoryRead` + `Historizing = true` |
| `IsAlarm` | drives the `MarkAsAlarmCondition` pass (see below) |
| `WriteIdempotent` | stored in `_writeIdempotentByFullRef`; fed to `CapabilityInvoker.ExecuteWriteAsync` |
- **DataType** -- Mapped from `mx_data_type` via `MxDataTypeMapper` (see [DataTypeMapping.md](DataTypeMapping.md))
- **ValueRank** -- `OneDimension` (1) for arrays, `Scalar` (-1) for scalars
- **ArrayDimensions** -- Set to `[array_dimension]` when the attribute is an array
- **AccessLevel** -- `CurrentReadOrWrite` or `CurrentRead` based on security classification, with `HistoryRead` added for historized attributes
- **Historizing** -- Set to `true` for attributes with a `HistoryExtension` primitive
- **Initial value** -- `null` with `StatusCode = BadWaitingForInitialData` until the first MXAccess callback delivers a live value
The initial value stays `null` with `StatusCode = BadWaitingForInitialData` until the first Read or `ISubscribable.OnDataChange` push lands.
## Primitive Grouping
## CapturingBuilder + alarm sink registration
Galaxy objects can have primitive components (e.g., alarm extensions, history extensions) that attach sub-attributes to a parent attribute. The address space handles this with a two-pass approach:
`GenericDriverNodeManager.BuildAddressSpaceAsync` wraps the supplied builder in a `CapturingBuilder` before calling `DiscoverAsync`. The wrapper observes every `Variable()` call: when a returned `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` fires, the sink is registered in the manager's `_alarmSinks` dictionary keyed by the variable's `FullReference`. Subsequent `IAlarmSource.OnAlarmEvent` pushes are routed to the matching sink by `SourceNodeId`. This keeps the alarm-wiring protocol declarative — drivers just flag `DriverAttributeInfo.IsAlarm = true` and the materialization of the OPC UA `AlarmConditionState` node is handled by the server layer. See `docs/AlarmTracking.md`.
### First pass: direct attributes
## NodeId scheme
Attributes with an empty `PrimitiveName` are created as direct variable children of the object node. If a direct attribute shares its name with a primitive group, the variable node reference is saved for the second pass.
### Second pass: primitive child attributes
Attributes with a non-empty `PrimitiveName` are grouped by that name. For each group:
1. If a direct attribute variable with the same name already exists, the primitive's child attributes are added as `HasComponent` children of that variable node. This merges alarm/history sub-attributes (e.g., `InAlarm`, `Priority`) under the parent variable they describe.
2. If no matching direct attribute exists, a new `BaseObjectState` node is created with NodeId `ns=1;s={TagName}.{PrimitiveName}`, and the primitive's attributes are added under it.
This structure means that browsing `TestMachine_001/SomeAlarmAttr` reveals both the process value and its alarm sub-attributes (`InAlarm`, `Priority`, `DescAttrName`) as children.
## NodeId Scheme
All node identifiers use string-based NodeIds in namespace index 1 (`ns=1`):
All nodes live in the driver's namespace (not a shared `ns=1`). Browse paths are driver-defined:
| Node type | NodeId format | Example |
|-----------|---------------|---------|
| Root folder | `ns=1;s=ZB` | `ns=1;s=ZB` |
| Area folder | `ns=1;s={tag_name}` | `ns=1;s=Area_001` |
| Object node | `ns=1;s={tag_name}` | `ns=1;s=TestMachine_001` |
| Scalar variable | `ns=1;s={tag_name}.{attr}` | `ns=1;s=TestMachine_001.MachineID` |
| Array variable | `ns=1;s={tag_name}.{attr}` | `ns=1;s=MESReceiver_001.MoveInPartNumbers` |
| Primitive sub-object | `ns=1;s={tag_name}.{prim}` | `ns=1;s=TestMachine_001.AlarmPrim` |
|---|---|---|
| Driver root | `ns;s={DriverInstanceId}` | `urn:OtOpcUa:galaxy-01;s=galaxy-01` |
| Folder | `ns;s={parent}/{browseName}` | `ns;s=galaxy-01/Area_001` |
| Variable | `ns;s={DriverAttributeInfo.FullName}` | `ns;s=DelmiaReceiver_001.DownloadPath` |
| Alarm condition | `ns;s={FullReference}.Condition` | `ns;s=DelmiaReceiver_001.Temperature.Condition` |
For array attributes, the `[]` suffix present in `full_tag_reference` is stripped from the NodeId. The `full_tag_reference` (with `[]`) is kept internally for MXAccess subscription addressing. This means `MESReceiver_001.MoveInPartNumbers[]` in the Galaxy maps to NodeId `ns=1;s=MESReceiver_001.MoveInPartNumbers`.
For Galaxy the `FullName` stays in the legacy `tag_name.AttributeName` format; Modbus uses `unit:register:type`; AB CIP uses the native `program:tag.member` path; etc. — the shape is the driver's choice.
## Topological Sort
## Per-driver hierarchy examples
The hierarchy query returns objects ordered by `parent_gobject_id, tag_name`, but this does not guarantee that a parent appears before all of its children in all cases. `LmxNodeManager.TopologicalSort` performs a depth-first traversal to produce a list where every parent is guaranteed to precede its children. This allows the build loop to look up parent nodes from `_nodeMap` without forward references.
- **Galaxy Proxy**: walks the DB-snapshot hierarchy (`GalaxyProxyDriver.DiscoverAsync`), streams Area objects as folders and non-area objects as variable-bearing folders, marks `IsAlarm = true` on attributes that have an `AlarmExtension` primitive. The v1 two-pass primitive-grouping logic is retained inside the Galaxy driver.
- **Modbus**: streams one folder per device, one variable per register range from `ModbusDriverOptions`. No alarm surface.
- **AB CIP**: uses `AbCipTemplateCache` to enumerate user-defined types, streams a folder per program with variables keyed on the native tag path.
- **OPC UA Client**: re-exposes a remote server's address space — browses the upstream and relays nodes through the builder.
## Platform Scope Filtering
See `docs/v2/driver-specs.md` for the per-driver discovery contracts.
When `GalaxyRepository.Scope` is set to `LocalPlatform`, the hierarchy and attributes passed to `BuildAddressSpace` are pre-filtered by `PlatformScopeFilter` inside `GalaxyRepositoryService`. The node manager receives only the local platform's objects and their ancestor areas, so the resulting browse tree is a subset of the full Galaxy. The filtering is transparent to `LmxNodeManager` — it builds nodes from whatever data it receives.
## Rediscovery
Clients browsing a `LocalPlatform`-scoped server will see only the areas and objects hosted by that platform. Areas that exist in the Galaxy but contain no local descendants are excluded. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) for the filtering algorithm and configuration.
## Incremental Sync
On address space rebuild (triggered by a Galaxy deploy change), `SyncAddressSpace` uses `AddressSpaceDiff` to identify which `gobject_id` values have changed between the old and new snapshots. Only the affected subtrees are torn down and rebuilt, preserving unchanged nodes and their active subscriptions. Affected subscriptions are snapshot before teardown and replayed after rebuild.
If no previous state is cached (first build), the full `BuildAddressSpace` path runs instead.
Drivers that implement `IRediscoverable` fire `OnRediscoveryNeeded` when their backend signals a change (Galaxy: `time_of_last_deploy` advance; TwinCAT: symbol-version-changed; OPC UA Client: server namespace change). Core re-runs `DiscoverAsync` and diffs — see `docs/IncrementalSync.md`. Static drivers (Modbus, S7) don't implement `IRediscoverable`; their address space only changes when a new generation is published from the Config DB.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/LmxNodeManager.cs` -- Node manager with `BuildAddressSpace`, `SyncAddressSpace`, and `TopologicalSort`
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/AddressSpaceBuilder.cs` -- Testable in-memory model builder
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — orchestration + `CapturingBuilder`
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — OPC UA materialization (`IAddressSpaceBuilder` impl + `NestedBuilder`)
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAddressSpaceBuilder.cs` — builder contract
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ITagDiscovery.cs` — driver discovery capability
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs` — per-attribute descriptor

View File

@@ -1,234 +1,76 @@
# Alarm Tracking
`LmxNodeManager` generates OPC UA alarm conditions from Galaxy attributes marked as alarms. The system detects alarm-capable attributes during address space construction, creates `AlarmConditionState` nodes, auto-subscribes to the runtime alarm tags via MXAccess, and reports state transitions as OPC UA events.
Alarm surfacing is an optional driver capability exposed via `IAlarmSource` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs`). Drivers whose backends have an alarm concept implement it — today: Galaxy (MXAccess alarms), FOCAS (CNC alarms), OPC UA Client (A&C events from the upstream server). Modbus / S7 / AB CIP / AB Legacy / TwinCAT do not implement the interface and the feature is simply absent from their subtrees.
## AlarmInfo Structure
Each tracked alarm is represented by an `AlarmInfo` instance stored in the `_alarmInAlarmTags` dictionary, keyed by the `InAlarm` tag reference:
## IAlarmSource surface
```csharp
private sealed class AlarmInfo
{
public string SourceTagReference { get; set; } // e.g., "Tag_001.Temperature"
public NodeId SourceNodeId { get; set; }
public string SourceName { get; set; } // attribute name for event messages
public bool LastInAlarm { get; set; } // tracks previous state for edge detection
public AlarmConditionState? ConditionNode { get; set; }
public string PriorityTagReference { get; set; } // e.g., "Tag_001.Temperature.Priority"
public string DescAttrNameTagReference { get; set; } // e.g., "Tag_001.Temperature.DescAttrName"
public ushort CachedSeverity { get; set; }
public string CachedMessage { get; set; }
}
Task<IAlarmSubscriptionHandle> SubscribeAlarmsAsync(
IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken);
Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken);
Task AcknowledgeAsync(IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements,
CancellationToken cancellationToken);
event EventHandler<AlarmEventArgs>? OnAlarmEvent;
```
`LastInAlarm` enables edge detection so only actual transitions (inactive-to-active or active-to-inactive) generate events, not repeated identical values.
The driver fires `OnAlarmEvent` for every transition (`Active`, `Acknowledged`, `Inactive`) with an `AlarmEventArgs` carrying the source node id, condition id, alarm type, message, severity (`AlarmSeverity` enum), and source timestamp.
## Alarm Detection via is_alarm Flag
## AlarmSurfaceInvoker
During `BuildAddressSpace` (and `BuildSubtree` for incremental sync), the node manager scans each non-area Galaxy object for attributes where `IsAlarm == true` and `PrimitiveName` is empty (direct attributes only, not primitive children):
`AlarmSurfaceInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs`) wraps the three mutating surfaces through `CapabilityInvoker`:
```csharp
var alarmAttrs = objAttrs.Where(a => a.IsAlarm && string.IsNullOrEmpty(a.PrimitiveName)).ToList();
```
- `SubscribeAlarmsAsync` / `UnsubscribeAlarmsAsync` run through the `DriverCapability.AlarmSubscribe` pipeline — retries apply under the tier configuration.
- `AcknowledgeAsync` runs through `DriverCapability.AlarmAcknowledge` which does NOT retry per decision #143. A timed-out ack may have already registered at the plant floor; replay would silently double-acknowledge.
The `IsAlarm` flag originates from the `AlarmExtension` primitive in the Galaxy repository database. When a Galaxy attribute has an associated `AlarmExtension` primitive, the SQL query sets `is_alarm = 1` on the corresponding `GalaxyAttributeInfo`.
Multi-host fan-out: when the driver implements `IPerCallHostResolver`, each source node id is resolved individually and batches are grouped by host so a dead PLC inside a multi-device driver doesn't poison sibling breakers. Single-host drivers fall back to `IDriver.DriverInstanceId` as the pipeline-key host.
For each alarm attribute, the code verifies that a corresponding `InAlarm` sub-attribute variable node exists in `_tagToVariableNode` (constructed from `FullTagReference + ".InAlarm"`). If the variable node is missing, the alarm is skipped -- this prevents creating orphaned alarm conditions for attributes whose extension primitives were not published.
## Condition-node creation via CapturingBuilder
## Template-Based Alarm Object Filter
Alarm-condition nodes are materialized at address-space build time. During `GenericDriverNodeManager.BuildAddressSpaceAsync` the builder is wrapped in a `CapturingBuilder` that observes every `Variable()` call. When a driver calls `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` on a returned handle, the server-side `DriverNodeManager.VariableHandle` creates a sibling `AlarmConditionState` node and returns an `IAlarmConditionSink`. The wrapper stores the sink in `_alarmSinks` keyed by the variable's full reference, then `GenericDriverNodeManager` registers a forwarder on `IAlarmSource.OnAlarmEvent` that routes each push to the matching sink by `SourceNodeId`. Unknown source ids are dropped silently — they may belong to another driver.
When large galaxies contain more alarm-bearing objects than clients need, `OpcUa.AlarmFilter.ObjectFilters` restricts alarm condition creation to a subset of objects selected by **template name pattern**. The filter is applied at both alarm creation sites -- the full build in `BuildAddressSpace` and the subtree rebuild path triggered by Galaxy redeployment -- so the included set is recomputed on every rebuild against the fresh hierarchy.
The `AlarmConditionState` layout matches OPC UA Part 9:
### Matching rules
- `SourceNode` → the originating variable
- `SourceName` / `ConditionName` → from `AlarmConditionInfo.SourceName`
- Initial state: enabled, inactive, acknowledged, severity per `InitialSeverity`, retain false
- `HasCondition` references wire the source variable ↔ the condition node bidirectionally
- `*` is the only wildcard (glob-style, zero or more characters). All other regex metacharacters are escaped and matched literally.
- Matching is case-insensitive.
- The leading `$` used by Galaxy template `tag_name` values is normalized away on both the stored chain entry and the operator pattern, so `TestMachine*` matches the stored `$TestMachine`.
- Each configured entry may itself be comma-separated for operator convenience (`"TestMachine*, Pump_*"`).
- An empty list disables the filter and restores the prior behavior: every alarm-bearing object is tracked when `AlarmTrackingEnabled=true`.
Drivers flag alarm-bearing variables at discovery time via `DriverAttributeInfo.IsAlarm = true`. The Galaxy driver, for example, sets this on attributes that have an `AlarmExtension` primitive in the Galaxy repository DB; FOCAS sets it on the CNC alarm register.
### What gets included
## State transitions
Every Galaxy object whose **template derivation chain** contains any template matching any pattern is included. The chain walks `gobject.derived_from_gobject_id` from the instance through its immediate template and each ancestor template, up to `$Object`. An instance of `TestCoolMachine` whose chain is `$TestCoolMachine -> $TestMachine -> $UserDefined` matches the pattern `TestMachine` via the ancestor hit.
`ConditionSink.OnTransition` runs under the node manager's `Lock` and maps the `AlarmEventArgs.AlarmType` string to Part 9 state:
Inclusion propagates down the **containment hierarchy**: if an object matches, all of its descendants are included as well, regardless of their own template chains. This lets operators target a parent and pick up all its alarm-bearing children with one pattern.
| AlarmType | Action |
|---|---|
| `Active` | `SetActiveState(true)`, `SetAcknowledgedState(false)`, `Retain = true` |
| `Acknowledged` | `SetAcknowledgedState(true)` |
| `Inactive` | `SetActiveState(false)`; `Retain = false` once both inactive and acknowledged |
Each object is evaluated exactly once. Overlapping matches (multiple patterns hit, or both an ancestor and descendant match independently) never produce duplicate alarm condition subscriptions -- the filter operates on object identity via a `HashSet<int>` of included `GobjectId` values.
Severity is remapped: `AlarmSeverity.Low/Medium/High/Critical` → OPC UA numeric 250 / 500 / 700 / 900. `Message.Value` is set from `AlarmEventArgs.Message` on every transition. `ClearChangeMasks(true)` and `ReportEvent(condition)` fire the OPC UA event notification for clients subscribed to any ancestor notifier.
### Resolution algorithm
## Acknowledge dispatch
`AlarmObjectFilter.ResolveIncludedObjects(hierarchy)` runs once per build:
Alarm acknowledgement initiated by an OPC UA client flows:
1. Compile each pattern into a regex with `IgnoreCase | CultureInvariant | Compiled`.
2. Build a `parent -> children` map from the hierarchy. Orphans (parent id not in the hierarchy) are treated as roots.
3. BFS from each root with a `(nodeId, parentIncluded)` queue and a `visited` set for cycle defense.
4. At each node: if the parent was included OR any chain entry matches any pattern, add the node and mark its subtree as included.
5. Return the `HashSet<int>` of included object IDs. When no patterns are configured the filter is disabled and the method returns `null`, which the alarm loop treats as "no filtering".
1. The SDK invokes the `AlarmConditionState.OnAcknowledge` method delegate.
2. The handler checks the session's roles for `AlarmAck` — drivers never see a request the session wasn't entitled to make.
3. `AlarmSurfaceInvoker.AcknowledgeAsync` is called with the source / condition / comment tuple. The invoker groups by host and runs each batch through the no-retry `AlarmAcknowledge` pipeline.
After each resolution, `UnmatchedPatterns` exposes any raw pattern that matched zero objects so the startup log can warn about operator typos without failing startup.
Drivers return normally for success or throw to signal the ack failed at the backend.
### How the alarm loop applies the filter
## EventNotifier propagation
```csharp
// LmxNodeManager.BuildAddressSpace (and the subtree rebuild path)
if (_alarmTrackingEnabled)
{
var includedIds = ResolveAlarmFilterIncludedIds(sorted); // null if no filter
foreach (var obj in sorted)
{
if (obj.IsArea) continue;
if (includedIds != null && !includedIds.Contains(obj.GobjectId)) continue;
// ... existing alarm-attribute collection + AlarmConditionState creation
}
}
```
Drivers that want hierarchical alarm subscriptions propagate `EventNotifier.SubscribeToEvents` up the containment chain during discovery — the Galaxy driver flips the flag on every ancestor of an alarm-bearing object up to the driver root, mirroring v1 behavior. Clients subscribed at the driver root, a mid-level folder, or the `Objects/` root see alarm events from every descendant with an `AlarmConditionState` sibling. The driver-root `FolderState` is created in `DriverNodeManager.CreateAddressSpace` with `EventNotifier = SubscribeToEvents | HistoryRead` so alarm event subscriptions and alarm history both have a single natural target.
`ResolveAlarmFilterIncludedIds` also emits a one-line summary (`Alarm filter: X of Y objects included (Z pattern(s))`) and per-pattern warnings for patterns that matched nothing. The included count is published to the dashboard via `AlarmFilterIncludedObjectCount`.
## ConditionRefresh
### Runtime telemetry
The OPC UA `ConditionRefresh` service queues the current state of every retained condition back to the requesting monitored items. `DriverNodeManager` iterates the node manager's `AlarmConditionState` collection and queues each condition whose `Retain.Value == true` — matching the Part 9 requirement.
`LmxNodeManager` exposes three read-only properties populated by the filter:
## Key source files
- `AlarmFilterEnabled` -- true when patterns are configured.
- `AlarmFilterPatternCount` -- number of compiled patterns.
- `AlarmFilterIncludedObjectCount` -- number of objects in the most recent included set.
`StatusReportService` reads these into `AlarmStatusInfo.FilterEnabled`, `FilterPatternCount`, and `FilterIncludedObjectCount`. The Alarms panel on the dashboard renders `Filter: N pattern(s), M object(s) included` only when the filter is enabled. See [Status Dashboard](StatusDashboard.md#alarms).
### Validator warning
`ConfigurationValidator.ValidateAndLog()` logs the effective filter at startup and emits a `Warning` if `AlarmFilter.ObjectFilters` is non-empty while `AlarmTrackingEnabled` is `false`, because the filter would have no effect.
## AlarmConditionState Creation
Each detected alarm attribute produces an `AlarmConditionState` node:
```csharp
var condition = new AlarmConditionState(sourceVariable);
condition.Create(SystemContext, conditionNodeId,
new QualifiedName(alarmAttr.AttributeName + "Alarm", NamespaceIndex),
new LocalizedText("en", alarmAttr.AttributeName + " Alarm"),
true);
```
Key configuration on the condition node:
- **SourceNode** -- Set to the OPC UA NodeId of the source variable, linking the condition to the attribute that triggered it.
- **SourceName / ConditionName** -- Set to the Galaxy attribute name for identification in event notifications.
- **AutoReportStateChanges** -- Set to `true` so the OPC UA framework automatically generates event notifications when condition properties change.
- **Initial state** -- Enabled, inactive, acknowledged, severity Medium, retain false.
- **HasCondition references** -- Bidirectional references are added between the source variable and the condition node.
The condition's `OnReportEvent` callback forwards events to `Server.ReportEvent` so they reach clients subscribed at the server level.
### Condition Methods
Each alarm condition supports the following OPC UA Part 9 methods:
- **Acknowledge** (`OnAcknowledge`) -- Writes the acknowledgment message to the Galaxy `AckMsg` tag. Requires the `AlarmAck` role.
- **Confirm** (`OnConfirm`) -- Confirms a previously acknowledged alarm. The SDK manages the `ConfirmedState` transition.
- **AddComment** (`OnAddComment`) -- Attaches an operator comment to the condition for audit trail purposes.
- **Enable / Disable** (`OnEnableDisable`) -- Activates or deactivates alarm monitoring for the specific condition. The SDK manages the `EnabledState` transition.
- **Shelve** (`OnShelve`) -- Supports `TimedShelve`, `OneShotShelve`, and `Unshelve` operations. The SDK manages the `ShelvedStateMachineType` state transitions including automatic timed unshelve.
- **TimedUnshelve** (`OnTimedUnshelve`) -- Automatically called by the SDK when a timed shelve period expires.
### Event Fields
Alarm events include the following fields:
- `EventId` -- Unique GUID for each event, used as reference for Acknowledge/Confirm
- `ActiveState`, `AckedState`, `ConfirmedState` -- State transitions
- `Message` -- Alarm message from Galaxy `DescAttrName` or default text
- `Severity` -- Galaxy Priority clamped to OPC UA range 1-1000
- `Retain` -- True while alarm is active or unacknowledged
- `LocalTime` -- Server timezone offset with daylight saving flag
- `Quality` -- Set to Good for alarm events
## Auto-subscription to Alarm Tags
After alarm condition nodes are created, `SubscribeAlarmTags` opens MXAccess subscriptions for three tags per alarm:
1. **InAlarm** (`Tag_001.Temperature.InAlarm`) -- The boolean trigger for alarm activation/deactivation.
2. **Priority** (`Tag_001.Temperature.Priority`) -- Numeric priority that maps to OPC UA severity.
3. **DescAttrName** (`Tag_001.Temperature.DescAttrName`) -- String description used as the alarm event message.
These subscriptions are opened unconditionally (not ref-counted) because they serve the server's own alarm tracking, not client-initiated monitoring. Tags that do not have corresponding variable nodes in `_tagToVariableNode` are skipped.
## EventNotifier Propagation
When a Galaxy object contains at least one alarm attribute, `EventNotifiers.SubscribeToEvents` is set on the object node **and all its ancestors** up to the root. This allows OPC UA clients to subscribe to events at any level in the hierarchy and receive alarm notifications from all descendants:
```csharp
if (hasAlarms && _nodeMap.TryGetValue(obj.GobjectId, out var objNode))
EnableEventNotifierUpChain(objNode);
```
For example, an alarm on `TestMachine_001.SubObject.Temperature` will be visible to clients subscribed on `SubObject`, `TestMachine_001`, or the root `ZB` folder. The root `ZB` folder also has `EventNotifiers.SubscribeToEvents` enabled during initial construction.
## InAlarm Transition Detection in DispatchLoop
Alarm state changes are detected in the dispatch loop's Phase 1 (outside `Lock`), which runs on the background dispatch thread rather than the STA thread. This placement is intentional because the detection logic reads Priority and DescAttrName values from MXAccess, which would block the STA thread if done inside the `OnMxAccessDataChange` callback.
For each pending data change, the loop checks whether the address matches a key in `_alarmInAlarmTags`:
```csharp
if (_alarmInAlarmTags.TryGetValue(address, out var alarmInfo))
{
var newInAlarm = vtq.Value is true || vtq.Value is 1
|| (vtq.Value is int intVal && intVal != 0);
if (newInAlarm != alarmInfo.LastInAlarm)
{
alarmInfo.LastInAlarm = newInAlarm;
// Read Priority and DescAttrName via MXAccess (outside Lock)
...
pendingAlarmEvents.Add((alarmInfo, newInAlarm));
}
}
```
The boolean coercion handles multiple value representations: `true`, integer `1`, or any non-zero integer. When the value changes state, Priority and DescAttrName are read synchronously from MXAccess to populate `CachedSeverity` and `CachedMessage`. These reads happen outside `Lock` because they call into the STA thread.
Priority values are clamped to the OPC UA severity range (1-1000). Both `int` and `short` types are handled.
## ReportAlarmEvent
`ReportAlarmEvent` runs inside `Lock` during Phase 2 of the dispatch loop. It updates the `AlarmConditionState` and generates an OPC UA event:
```csharp
condition.SetActiveState(SystemContext, active);
condition.Message.Value = new LocalizedText("en", message);
condition.SetSeverity(SystemContext, (EventSeverity)severity);
condition.Retain.Value = active || (condition.AckedState?.Id?.Value == false);
```
Key behaviors:
- **Active state** -- Set to `true` on activation, `false` on clearing.
- **Message** -- Uses `CachedMessage` (from DescAttrName) when available on activation. Falls back to a generated `"Alarm active: {SourceName}"` string. Cleared alarms always use `"Alarm cleared: {SourceName}"`.
- **Severity** -- Set from `CachedSeverity`, which was read from the Priority tag.
- **Retain** -- `true` while the alarm is active or unacknowledged. This keeps the condition visible in condition refresh responses.
- **Acknowledged state** -- Reset to `false` when the alarm activates, requiring explicit client acknowledgment. When role-based auth is active, alarm acknowledgment requires the `AlarmAck` role on the session (checked via `GrantedRoleIds`). Users without this role receive `BadUserAccessDenied`.
The event is reported by walking up the notifier chain from the source variable's parent through all ancestor nodes. Each ancestor with `EventNotifier` set receives the event via `ReportEvent`, so clients subscribed at any level in the Galaxy hierarchy see alarm transitions from descendant objects.
## Condition Refresh Override
The `ConditionRefresh` override iterates all tracked alarms and queues retained conditions to the requesting monitored items:
```csharp
public override ServiceResult ConditionRefresh(OperationContext context,
IList<IEventMonitoredItem> monitoredItems)
{
foreach (var kvp in _alarmInAlarmTags)
{
var info = kvp.Value;
if (info.ConditionNode == null || info.ConditionNode.Retain?.Value != true)
continue;
foreach (var item in monitoredItems)
item.QueueEvent(info.ConditionNode);
}
return ServiceResult.Good;
}
```
Only conditions where `Retain.Value == true` are included. This means only active or unacknowledged alarms appear in condition refresh responses, matching the OPC UA specification requirement that condition refresh returns the current state of all retained conditions.
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` — capability contract + `AlarmEventArgs`
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs` — per-host fan-out + no-retry ack
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs``CapturingBuilder` + alarm forwarder
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs``VariableHandle.MarkAsAlarmCondition` + `ConditionSink`
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Alarms/GalaxyAlarmTracker.cs` — Galaxy-specific alarm-event production

View File

@@ -2,9 +2,9 @@
## Overview
`ZB.MOM.WW.OtOpcUa.Client.CLI` is a cross-platform command-line client for the LmxOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. Commands are routed and parsed by [CliFx](https://github.com/Tyrrrz/CliFx).
`ZB.MOM.WW.OtOpcUa.Client.CLI` is a cross-platform command-line client for the OtOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations. Commands are routed and parsed by [CliFx](https://github.com/Tyrrrz/CliFx).
The CLI is the primary tool for operators and developers to test and interact with the server from a terminal. It supports all core operations: connectivity testing, browsing, reading, writing, subscriptions, alarm monitoring, history reads, and redundancy queries.
The CLI is the primary tool for operators and developers to test and interact with the server from a terminal. It supports all core operations: connectivity testing, browsing, reading, writing, subscriptions, alarm monitoring, history reads, and redundancy queries. Any driver surface exposed by the server (Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, OPC UA Client) is reachable through these commands — the CLI is driver-agnostic because everything below the OPC UA endpoint is.
## Build and Run
@@ -14,7 +14,7 @@ dotnet build
dotnet run -- <command> [options]
```
The executable name is `lmxopcua-cli`.
The executable name is still `lmxopcua-cli` — a residual from the pre-v2 rename (`Program.cs:SetExecutableName`). Scripts + operator muscle memory depend on the name; flipping it to `otopcua-cli` is a follow-up that also needs to move the client-side PKI store folder (<code>{LocalAppData}/LmxOpcUaClient/pki/</code> — used by the shared client for its application certificate) so trust relationships survive the rename.
## Architecture
@@ -54,7 +54,7 @@ lmxopcua-cli write -u opc.tcp://localhost:4840 -n "ns=2;s=MyNode" -v 42 -U opera
When `-F` is provided, the shared service tries the primary URL first, then each failover URL in order. For long-running commands (`subscribe`, `alarms`), the service monitors the session via keep-alive and automatically reconnects to the next available server on failure.
```bash
lmxopcua-cli connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa
lmxopcua-cli connect -u opc.tcp://localhost:4840/OtOpcUa -F opc.tcp://localhost:4841/OtOpcUa
```
### Transport Security
@@ -67,7 +67,7 @@ When `sign` or `encrypt` is specified, the shared service:
4. Fails with a clear error if no matching endpoint is found
```bash
lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt -U admin -P secret -r -d 2
lmxopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -S encrypt -U admin -P secret -r -d 2
```
### Verbose Logging
@@ -81,14 +81,14 @@ The `--verbose` flag switches Serilog output from `Warning` to `Debug` level, sh
Tests connectivity to an OPC UA server. Creates a session, prints connection metadata, and disconnects.
```bash
lmxopcua-cli connect -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123
lmxopcua-cli connect -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123
```
Output:
```text
Connected to: opc.tcp://localhost:4840/LmxOpcUa
Server: LmxOpcUa
Connected to: opc.tcp://localhost:4840/OtOpcUa
Server: OtOpcUa Server
Security Mode: None
Security Policy: http://opcfoundation.org/UA/SecurityPolicy#None
Connection successful.
@@ -99,7 +99,7 @@ Connection successful.
Reads the current value of a single node and prints the value, status code, and timestamps.
```bash
lmxopcua-cli read -u opc.tcp://localhost:4840/LmxOpcUa -n "ns=3;s=DEV.ScanState" -U admin -P admin123
lmxopcua-cli read -u opc.tcp://localhost:4840/OtOpcUa -n "ns=3;s=DEV.ScanState" -U admin -P admin123
```
| Flag | Description |
@@ -135,10 +135,10 @@ Browses the OPC UA address space starting from the Objects folder or a specified
```bash
# Browse top-level Objects folder
lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123
lmxopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123
# Browse a specific node recursively to depth 3
lmxopcua-cli browse -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123 -r -d 3 -n "ns=3;s=ZB"
lmxopcua-cli browse -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123 -r -d 3 -n "ns=3;s=ZB"
```
| Flag | Description |
@@ -166,12 +166,12 @@ Reads historical data from a node. Supports raw history reads and aggregate (pro
```bash
# Raw history
lmxopcua-cli historyread -u opc.tcp://localhost:4840/LmxOpcUa \
lmxopcua-cli historyread -u opc.tcp://localhost:4840/OtOpcUa \
-n "ns=1;s=TestMachine_001.TestHistoryValue" \
--start "2026-03-25" --end "2026-03-30"
# Aggregate: 1-hour average
lmxopcua-cli historyread -u opc.tcp://localhost:4840/LmxOpcUa \
lmxopcua-cli historyread -u opc.tcp://localhost:4840/OtOpcUa \
-n "ns=1;s=TestMachine_001.TestHistoryValue" \
--start "2026-03-25" --end "2026-03-30" \
--aggregate Average --interval 3600000
@@ -203,10 +203,10 @@ Subscribes to alarm events on a node. Prints structured alarm output including s
```bash
# Subscribe to alarm events on the Server node
lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa
lmxopcua-cli alarms -u opc.tcp://localhost:4840/OtOpcUa
# Subscribe to a specific source node with condition refresh
lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa \
lmxopcua-cli alarms -u opc.tcp://localhost:4840/OtOpcUa \
-n "ns=1;s=TestMachine_001" --refresh
```
@@ -221,7 +221,7 @@ lmxopcua-cli alarms -u opc.tcp://localhost:4840/LmxOpcUa \
Reads the OPC UA redundancy state from a server: redundancy mode, service level, server URIs, and application URI.
```bash
lmxopcua-cli redundancy -u opc.tcp://localhost:4840/LmxOpcUa -U admin -P admin123
lmxopcua-cli redundancy -u opc.tcp://localhost:4840/OtOpcUa -U admin -P admin123
```
Example output:
@@ -230,9 +230,9 @@ Example output:
Redundancy Mode: Warm
Service Level: 200
Server URIs:
- urn:localhost:LmxOpcUa:instance1
- urn:localhost:LmxOpcUa:instance2
Application URI: urn:localhost:LmxOpcUa:instance1
- urn:localhost:OtOpcUa:instance1
- urn:localhost:OtOpcUa:instance2
Application URI: urn:localhost:OtOpcUa:instance1
```
## Testing

View File

@@ -2,7 +2,7 @@
## Overview
`ZB.MOM.WW.OtOpcUa.Client.UI` is a cross-platform Avalonia desktop application for connecting to and interacting with the LmxOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations.
`ZB.MOM.WW.OtOpcUa.Client.UI` is a cross-platform Avalonia desktop application for connecting to and interacting with the OtOpcUa OPC UA server. It targets .NET 10 and uses the shared `IOpcUaClientService` from `Client.Shared` for all OPC UA operations.
The UI provides a single-window interface for browsing the address space, reading and writing values, monitoring live subscriptions, managing alarms, and querying historical data.
@@ -43,7 +43,7 @@ The application uses a single-window layout with five main areas:
│ │ │ ││
│ (lazy-load) │ └──────────────────────────────────────────────┘│
├──────────────┴──────────────────────────────────────────────┤
│ Connected to opc.tcp://... | LmxOpcUa | Session: ... | 3 subs│
│ Connected to opc.tcp://... | OtOpcUa Server | Session: ... | 3 subs│
└─────────────────────────────────────────────────────────────┘
```
@@ -55,7 +55,7 @@ The top bar provides the endpoint URL, Connect, and Disconnect buttons. The **Co
| Setting | Description |
|---------|-------------|
| Endpoint URL | OPC UA server endpoint (e.g., `opc.tcp://localhost:4840/LmxOpcUa`) |
| Endpoint URL | OPC UA server endpoint (e.g., `opc.tcp://localhost:4840/OtOpcUa`) |
| Username / Password | Credentials for `UserName` token authentication |
| Security Mode | Transport security: None, Sign, SignAndEncrypt |
| Failover URLs | Comma-separated backup endpoints for redundancy failover |
@@ -65,7 +65,7 @@ The top bar provides the endpoint URL, Connect, and Disconnect buttons. The **Co
### Settings Persistence
Connection settings are saved to `{LocalAppData}/LmxOpcUaClient/settings.json` after each successful connection and on window close. The settings are reloaded on next launch, including:
Connection settings are saved to `{LocalAppData}/LmxOpcUaClient/settings.json` after each successful connection and on window close. The folder name is a residual from the pre-v2 rename (the `Client.Shared` session factory still calls itself `LmxOpcUaClient` at `OpcUaClientService.cs:428`); renaming to `OtOpcUaClient` is a follow-up that needs a migration shim so existing users don't lose their settings on upgrade. The settings are reloaded on next launch, including:
- All connection parameters
- Active subscription node IDs (restored after reconnection)

View File

@@ -1,370 +1,141 @@
# Configuration
## Overview
## Two-layer model
The service loads configuration from `appsettings.json` at startup using the Microsoft.Extensions.Configuration stack. `AppConfiguration` is the root holder class that aggregates typed sections: `OpcUa`, `MxAccess`, `GalaxyRepository`, `Dashboard`, `Historian`, `Authentication`, and `Security`. Each section binds to a dedicated POCO class with sensible defaults, so the service runs with zero configuration on a standard deployment.
OtOpcUa configuration is split into two layers:
## Config Binding Pattern
| Layer | Where | Scope | Edited by |
|---|---|---|---|
| **Bootstrap** | `appsettings.json` per process | Enough to start the process and reach the Config DB | Local file edit + process restart |
| **Authoritative config** | Config DB (SQL Server) via `OtOpcUaConfigDbContext` | Clusters, namespaces, UNS hierarchy, equipment, tags, driver instances, ACLs, role grants, poll groups | Admin UI draft/publish workflow |
The production constructor in `OpcUaService` builds the configuration pipeline and binds each JSON section to its typed class:
The rule: if the setting describes *how the process connects to the rest of the world* (Config DB connection string, LDAP bind, transport security profile, node identity, logging), it lives in `appsettings.json`. If it describes *what the fleet does* (clusters, drivers, tags, UNS, ACLs), it lives in the Config DB and is edited through the Admin UI.
```csharp
var configuration = new ConfigurationBuilder()
.AddJsonFile("appsettings.json", optional: false)
.AddJsonFile($"appsettings.{Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT") ?? "Production"}.json", optional: true)
.AddEnvironmentVariables()
.Build();
---
_config = new AppConfiguration();
configuration.GetSection("OpcUa").Bind(_config.OpcUa);
configuration.GetSection("MxAccess").Bind(_config.MxAccess);
configuration.GetSection("GalaxyRepository").Bind(_config.GalaxyRepository);
configuration.GetSection("Dashboard").Bind(_config.Dashboard);
configuration.GetSection("Historian").Bind(_config.Historian);
configuration.GetSection("Authentication").Bind(_config.Authentication);
configuration.GetSection("Security").Bind(_config.Security);
```
## Bootstrap configuration (`appsettings.json`)
This pattern uses `IConfiguration.GetSection().Bind()` rather than `IOptions<T>` because the service targets .NET Framework 4.8, where the full dependency injection container is not used.
Each of the three processes (Server, Admin, Galaxy.Host) reads its own `appsettings.json` plus environment overrides.
## Environment-Specific Overrides
### OtOpcUa Server — `src/ZB.MOM.WW.OtOpcUa.Server/appsettings.json`
The configuration pipeline supports three layers of override, applied in order:
Bootstrap-only. `Program.cs` reads four top-level sections:
1. `appsettings.json` -- base configuration (required)
2. `appsettings.{DOTNET_ENVIRONMENT}.json` -- environment-specific overlay (optional)
3. Environment variables -- highest priority, useful for deployment automation
| Section | Keys | Purpose |
|---|---|---|
| `Node` | `NodeId`, `ClusterId`, `ConfigDbConnectionString`, `LocalCachePath` | Identity + path to the Config DB + LiteDB offline cache path. |
| `OpcUaServer` | `EndpointUrl`, `ApplicationName`, `ApplicationUri`, `PkiStoreRoot`, `AutoAcceptUntrustedClientCertificates`, `SecurityProfile` | OPC UA endpoint + transport security. See [`security.md`](security.md). |
| `OpcUaServer:Ldap` | `Enabled`, `Server`, `Port`, `UseTls`, `AllowInsecureLdap`, `SearchBase`, `ServiceAccountDn`, `ServiceAccountPassword`, `GroupToRole`, `UserNameAttribute`, `GroupAttribute` | LDAP auth for OPC UA UserName tokens. See [`security.md`](security.md). |
| `Serilog` | Standard Serilog keys + `WriteJson` bool | Logging verbosity + optional JSON file sink for SIEM ingest. |
| `Authorization` | `StrictMode` (bool) | Flip `true` to fail-closed on sessions lacking LDAP group metadata. Default false during ACL rollouts. |
| `Metrics:Prometheus:Enabled` | bool | Toggles the `/metrics` endpoint. |
Set the `DOTNET_ENVIRONMENT` variable to load a named overlay file. For example, setting `DOTNET_ENVIRONMENT=Staging` loads `appsettings.Staging.json` if it exists.
Environment variables follow the standard `Section__Property` naming convention. For example, `OpcUa__Port=5840` overrides the OPC UA port.
## Configuration Sections
### OpcUa
Controls the OPC UA server endpoint and session limits. Defined in `OpcUaConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `BindAddress` | `string` | `"0.0.0.0"` | IP address or hostname the server binds to. Use `0.0.0.0` for all interfaces, `localhost` for local-only, or a specific IP |
| `Port` | `int` | `4840` | TCP port the OPC UA server listens on |
| `EndpointPath` | `string` | `"/LmxOpcUa"` | Path appended to the host URI |
| `ServerName` | `string` | `"LmxOpcUa"` | Server name presented to OPC UA clients |
| `GalaxyName` | `string` | `"ZB"` | Galaxy name used as the OPC UA namespace |
| `MaxSessions` | `int` | `100` | Maximum simultaneous OPC UA sessions |
| `SessionTimeoutMinutes` | `int` | `30` | Idle session timeout in minutes |
| `AlarmTrackingEnabled` | `bool` | `false` | Enables `AlarmConditionState` nodes for alarm attributes |
| `AlarmFilter.ObjectFilters` | `List<string>` | `[]` | Wildcard template-name patterns (with `*`) that scope alarm tracking to matching objects and their descendants. Empty list disables filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) |
| `ApplicationUri` | `string?` | `null` | Explicit application URI for this server instance. Required when redundancy is enabled. Defaults to `urn:{GalaxyName}:LmxOpcUa` when null |
### MxAccess
Controls the MXAccess runtime connection used for live tag reads and writes. Defined in `MxAccessConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `ClientName` | `string` | `"LmxOpcUa"` | Client name registered with MXAccess |
| `NodeName` | `string?` | `null` | Optional Galaxy node name to target |
| `GalaxyName` | `string?` | `null` | Optional Galaxy name for MXAccess reference resolution |
| `ReadTimeoutSeconds` | `int` | `5` | Maximum wait for a live tag read |
| `WriteTimeoutSeconds` | `int` | `5` | Maximum wait for a write acknowledgment |
| `MaxConcurrentOperations` | `int` | `10` | Cap on concurrent MXAccess operations |
| `MonitorIntervalSeconds` | `int` | `5` | Connectivity monitor probe interval |
| `AutoReconnect` | `bool` | `true` | Automatically re-establish dropped MXAccess sessions |
| `ProbeTag` | `string?` | `null` | Optional tag used to verify the runtime returns fresh data |
| `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale |
| `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) |
| `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning |
| `RequestTimeoutSeconds` | `int` | `30` | Outer safety timeout applied to sync-over-async MxAccess operations invoked from the OPC UA stack thread (Read, Write, address-space rebuild probe sync). Backstop for the inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds`. A timed-out operation returns `BadTimeout`. Validator rejects values < 1 and warns if set below the inner Read/Write timeouts. See [MXAccess Bridge](MxAccessBridge.md#request-timeout-safety-backstop). Stability review 2026-04-13 Finding 3 |
### GalaxyRepository
Controls the Galaxy repository database connection used to build the OPC UA address space. Defined in `GalaxyRepositoryConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `ConnectionString` | `string` | `"Server=localhost;Database=ZB;Integrated Security=true;"` | SQL Server connection string for the Galaxy database |
| `ChangeDetectionIntervalSeconds` | `int` | `30` | How often the service polls for Galaxy deploy changes |
| `CommandTimeoutSeconds` | `int` | `30` | SQL command timeout for repository queries |
| `ExtendedAttributes` | `bool` | `false` | Load extended Galaxy attribute metadata into the OPC UA model |
| `Scope` | `GalaxyScope` | `"Galaxy"` | Controls how much of the Galaxy hierarchy is loaded. `Galaxy` loads all deployed objects (default). `LocalPlatform` loads only objects hosted by the platform deployed on this machine. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) |
| `PlatformName` | `string?` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName`. Only used when `Scope` is `LocalPlatform` |
### Dashboard
Controls the embedded HTTP status dashboard. Defined in `DashboardConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Enabled` | `bool` | `true` | Whether the status dashboard is hosted |
| `Port` | `int` | `8081` | HTTP port for the dashboard endpoint |
| `RefreshIntervalSeconds` | `int` | `10` | HTML auto-refresh interval in seconds |
### Historian
Controls the Wonderware Historian SDK connection for OPC UA historical data access. Defined in `HistorianConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Enabled` | `bool` | `false` | Enables OPC UA historical data access |
| `ServerName` | `string` | `"localhost"` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
| `ServerNames` | `List<string>` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover. See [Historical Data Access](HistoricalDataAccess.md#read-only-cluster-failover) |
| `FailureCooldownSeconds` | `int` | `60` | How long a failed cluster node is skipped before being re-tried. Zero disables the cooldown |
| `IntegratedSecurity` | `bool` | `true` | Use Windows authentication |
| `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false |
| `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false |
| `Port` | `int` | `32568` | Historian TCP port |
| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds (inner async bound) |
| `RequestTimeoutSeconds` | `int` | `60` | Outer safety timeout applied to sync-over-async Historian operations invoked from the OPC UA stack thread (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`). Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Validator rejects values < 1 and warns if set below `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
| `MaxValuesPerRead` | `int` | `10000` | Maximum values returned per `HistoryRead` request |
### Authentication
Controls user authentication and write authorization for the OPC UA server. Defined in `AuthenticationConfiguration`.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `AllowAnonymous` | `bool` | `true` | Accepts anonymous client connections when `true` |
| `AnonymousCanWrite` | `bool` | `true` | Permits anonymous users to write when `true` |
#### LDAP Authentication
When `Ldap.Enabled` is `true`, credentials are validated against the configured LDAP server and group membership determines OPC UA permissions.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Ldap.Enabled` | `bool` | `false` | Enables LDAP authentication |
| `Ldap.Host` | `string` | `localhost` | LDAP server hostname |
| `Ldap.Port` | `int` | `3893` | LDAP server port |
| `Ldap.BaseDN` | `string` | `dc=lmxopcua,dc=local` | Base DN for LDAP operations |
| `Ldap.BindDnTemplate` | `string` | `cn={username},dc=lmxopcua,dc=local` | Bind DN template (`{username}` is replaced) |
| `Ldap.ServiceAccountDn` | `string` | `""` | Service account DN for group lookups |
| `Ldap.ServiceAccountPassword` | `string` | `""` | Service account password |
| `Ldap.TimeoutSeconds` | `int` | `5` | Connection timeout |
| `Ldap.ReadOnlyGroup` | `string` | `ReadOnly` | LDAP group granting read-only access |
| `Ldap.WriteOperateGroup` | `string` | `WriteOperate` | LDAP group granting write access for FreeAccess/Operate attributes |
| `Ldap.WriteTuneGroup` | `string` | `WriteTune` | LDAP group granting write access for Tune attributes |
| `Ldap.WriteConfigureGroup` | `string` | `WriteConfigure` | LDAP group granting write access for Configure attributes |
| `Ldap.AlarmAckGroup` | `string` | `AlarmAck` | LDAP group granting alarm acknowledgment |
#### Permission Model
When LDAP is enabled, LDAP group membership is mapped to OPC UA session role NodeIds during authentication. All authenticated LDAP users can browse and read nodes regardless of group membership. Groups grant additional permissions:
| LDAP Group | Permission |
|---|---|
| ReadOnly | No additional permissions (read-only access) |
| WriteOperate | Write FreeAccess and Operate attributes |
| WriteTune | Write Tune attributes |
| WriteConfigure | Write Configure attributes |
| AlarmAck | Acknowledge alarms |
Users can belong to multiple groups. The `admin` user in the default GLAuth configuration belongs to all three groups.
Write access depends on both the user's role and the Galaxy attribute's security classification. See the [Effective Permission Matrix](Security.md#effective-permission-matrix) in the Security Guide for the full breakdown.
Example configuration:
```json
"Authentication": {
"AllowAnonymous": true,
"AnonymousCanWrite": false,
"Ldap": {
"Enabled": true,
"Host": "localhost",
"Port": 3893,
"BaseDN": "dc=lmxopcua,dc=local",
"BindDnTemplate": "cn={username},dc=lmxopcua,dc=local",
"ServiceAccountDn": "cn=serviceaccount,dc=lmxopcua,dc=local",
"ServiceAccountPassword": "serviceaccount123",
"TimeoutSeconds": 5,
"ReadOnlyGroup": "ReadOnly",
"WriteOperateGroup": "WriteOperate",
"WriteTuneGroup": "WriteTune",
"WriteConfigureGroup": "WriteConfigure",
"AlarmAckGroup": "AlarmAck"
}
}
```
### Security
Controls OPC UA transport security profiles and certificate handling. Defined in `SecurityProfileConfiguration`. See [Security Guide](security.md) for detailed usage.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Profiles` | `List<string>` | `["None"]` | Security profiles to expose. Valid: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt` |
| `AutoAcceptClientCertificates` | `bool` | `true` | Auto-accept untrusted client certificates. Set to `false` in production |
| `RejectSHA1Certificates` | `bool` | `true` | Reject client certificates signed with SHA-1 |
| `MinimumCertificateKeySize` | `int` | `2048` | Minimum RSA key size for client certificates |
| `PkiRootPath` | `string?` | `null` | Override for PKI root directory. Defaults to `%LOCALAPPDATA%\OPC Foundation\pki` |
| `CertificateSubject` | `string?` | `null` | Override for server certificate subject. Defaults to `CN={ServerName}, O=ZB MOM, DC=localhost` |
Example — production deployment with encrypted transport:
```json
"Security": {
"Profiles": ["Basic256Sha256-SignAndEncrypt"],
"AutoAcceptClientCertificates": false,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048
}
```
### Redundancy
Controls non-transparent OPC UA redundancy. Defined in `RedundancyConfiguration`. See [Redundancy Guide](Redundancy.md) for detailed usage.
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| `Enabled` | `bool` | `false` | Enables redundancy mode and ServiceLevel computation |
| `Mode` | `string` | `"Warm"` | Redundancy mode: `Warm` or `Hot` |
| `Role` | `string` | `"Primary"` | Instance role: `Primary` (higher ServiceLevel) or `Secondary` |
| `ServerUris` | `List<string>` | `[]` | ApplicationUri values for all servers in the redundant set |
| `ServiceLevelBase` | `int` | `200` | Base ServiceLevel when healthy (1-255). Secondary receives base - 50 |
Example — two-instance redundant pair (Primary):
```json
"Redundancy": {
"Enabled": true,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": ["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"],
"ServiceLevelBase": 200
}
```
## Feature Flags
Three boolean properties act as feature flags that control optional subsystems:
- **`OpcUa.AlarmTrackingEnabled`** -- When `true`, the node manager creates `AlarmConditionState` nodes for alarm attributes and monitors `InAlarm` transitions. Disabled by default because alarm tracking adds per-attribute overhead.
- **`OpcUa.AlarmFilter.ObjectFilters`** -- List of wildcard template-name patterns that scope alarm tracking to matching objects and their descendants. An empty list preserves the current unfiltered behavior; a non-empty list includes an object only when any name in its template derivation chain matches any pattern, then propagates the inclusion to every descendant in the containment hierarchy. `*` is the only wildcard, matching is case-insensitive, and the Galaxy `$` prefix on template names is normalized so operators can write `TestMachine*` instead of `$TestMachine*`. Each list entry may itself contain comma-separated patterns (`"TestMachine*, Pump_*"`) for convenience. When the list is non-empty but `AlarmTrackingEnabled` is `false`, the validator emits a warning because the filter has no effect. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the full matching algorithm and telemetry.
- **`Historian.Enabled`** -- When `true`, the service calls `HistorianPluginLoader.TryLoad(config)` to load the `ZB.MOM.WW.OtOpcUa.Historian.Aveva` plugin from the `Historian/` subfolder next to the host exe and registers the resulting `IHistorianDataSource` with the OPC UA server host. Disabled by default because not all deployments have a Historian instance -- when disabled the plugin is not probed and the Wonderware SDK DLLs are not required on the host. If the flag is `true` but the plugin or its SDK dependencies cannot be loaded, the server still starts and every history read returns `BadHistoryOperationUnsupported` with a warning in the log.
- **`GalaxyRepository.ExtendedAttributes`** -- When `true`, the repository loads additional Galaxy attribute metadata beyond the core set needed for the address space. Disabled by default to minimize startup query time.
- **`GalaxyRepository.Scope`** -- When set to `LocalPlatform`, the repository filters the hierarchy and attributes to only include objects hosted by the platform whose `node_name` matches this machine (or the explicit `PlatformName` override). Ancestor areas are retained to keep the browse tree connected. Default is `Galaxy` (load everything). See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter).
## Configuration Validation
`ConfigurationValidator.ValidateAndLog()` runs at the start of `OpcUaService.Start()`. It logs every resolved configuration value at `Information` level and validates required constraints:
- `OpcUa.Port` must be between 1 and 65535
- `OpcUa.GalaxyName` must not be empty
- `MxAccess.ClientName` must not be empty
- `GalaxyRepository.ConnectionString` must not be empty
- `Security.MinimumCertificateKeySize` must be at least 2048
- Unknown security profile names are logged as warnings
- `AutoAcceptClientCertificates = true` emits a warning
- Only-`None` profile configuration emits a warning
- `OpcUa.AlarmFilter.ObjectFilters` is non-empty while `OpcUa.AlarmTrackingEnabled = false` emits a warning (filter has no effect)
- `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true`
- `Historian.FailureCooldownSeconds` must be zero or positive
- `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored)
- `MxAccess.RuntimeStatusUnknownTimeoutSeconds` below 5s emits a warning (below the reasonable floor for MxAccess initial-resolution latency)
- `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true`
- `Redundancy.ServiceLevelBase` must be between 1 and 255
- `Redundancy.ServerUris` should contain at least 2 entries when enabled
- Local `ApplicationUri` should appear in `Redundancy.ServerUris`
If validation fails, the service throws `InvalidOperationException` and does not start.
## Test Constructor Pattern
`OpcUaService` provides an `internal` constructor that accepts pre-built dependencies instead of loading `appsettings.json`:
```csharp
internal OpcUaService(
AppConfiguration config,
IMxProxy? mxProxy,
IGalaxyRepository? galaxyRepository,
IMxAccessClient? mxAccessClientOverride = null,
bool hasMxAccessClientOverride = false)
```
Integration tests use this constructor to inject substitute implementations of `IMxProxy`, `IGalaxyRepository`, and `IMxAccessClient`, bypassing the STA thread, COM interop, and SQL Server dependencies. The `hasMxAccessClientOverride` flag tells the service to use the injected `IMxAccessClient` directly instead of creating one from the `IMxProxy` on the STA thread.
## Example appsettings.json
Minimal example:
```json
{
"OpcUa": {
"BindAddress": "0.0.0.0",
"Port": 4840,
"EndpointPath": "/LmxOpcUa",
"ServerName": "LmxOpcUa",
"GalaxyName": "ZB",
"MaxSessions": 100,
"SessionTimeoutMinutes": 30,
"AlarmTrackingEnabled": false,
"AlarmFilter": {
"ObjectFilters": []
},
"ApplicationUri": null
"Serilog": { "MinimumLevel": "Information" },
"Node": {
"NodeId": "node-dev-a",
"ClusterId": "cluster-dev",
"ConfigDbConnectionString": "Server=localhost,14330;Database=OtOpcUaConfig;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;",
"LocalCachePath": "config_cache.db"
},
"MxAccess": {
"ClientName": "LmxOpcUa",
"NodeName": null,
"GalaxyName": null,
"ReadTimeoutSeconds": 5,
"WriteTimeoutSeconds": 5,
"MaxConcurrentOperations": 10,
"MonitorIntervalSeconds": 5,
"AutoReconnect": true,
"ProbeTag": null,
"ProbeStaleThresholdSeconds": 60,
"RuntimeStatusProbesEnabled": true,
"RuntimeStatusUnknownTimeoutSeconds": 15,
"RequestTimeoutSeconds": 30
},
"GalaxyRepository": {
"ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;",
"ChangeDetectionIntervalSeconds": 30,
"CommandTimeoutSeconds": 30,
"ExtendedAttributes": false,
"Scope": "Galaxy",
"PlatformName": null
},
"Dashboard": {
"Enabled": true,
"Port": 8081,
"RefreshIntervalSeconds": 10
},
"Historian": {
"Enabled": false,
"ServerName": "localhost",
"ServerNames": [],
"FailureCooldownSeconds": 60,
"IntegratedSecurity": true,
"UserName": null,
"Password": null,
"Port": 32568,
"CommandTimeoutSeconds": 30,
"RequestTimeoutSeconds": 60,
"MaxValuesPerRead": 10000
},
"Authentication": {
"AllowAnonymous": true,
"AnonymousCanWrite": true,
"Ldap": {
"Enabled": false
}
},
"Security": {
"Profiles": ["None"],
"AutoAcceptClientCertificates": true,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048,
"PkiRootPath": null,
"CertificateSubject": null
},
"Redundancy": {
"Enabled": false,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": [],
"ServiceLevelBase": 200
"OpcUaServer": {
"EndpointUrl": "opc.tcp://0.0.0.0:4840/OtOpcUa",
"ApplicationUri": "urn:node-dev-a:OtOpcUa",
"SecurityProfile": "None",
"AutoAcceptUntrustedClientCertificates": true,
"Ldap": { "Enabled": false }
}
}
```
### OtOpcUa Admin — `src/ZB.MOM.WW.OtOpcUa.Admin/appsettings.json`
| Section | Purpose |
|---|---|
| `ConnectionStrings:ConfigDb` | SQL connection string — must point at the same Config DB every Server reaches. |
| `Authentication:Ldap` | LDAP bind for the Admin login form (same options shape as the Server's `OpcUaServer:Ldap`). |
| `CertTrust` | `CertTrustOptions` — file-system path under the Server's `PkiStoreRoot` so the Admin Certificates page can promote rejected client certs. |
| `Metrics:Prometheus:Enabled` | Toggles the `/metrics` scrape endpoint (default true). |
| `Serilog` | Logging. |
### Galaxy.Host
Environment-variable driven (`OTOPCUA_GALAXY_PIPE`, `OTOPCUA_ALLOWED_SID`, `OTOPCUA_GALAXY_SECRET`, `OTOPCUA_GALAXY_BACKEND`, `OTOPCUA_GALAXY_ZB_CONN`, `OTOPCUA_HISTORIAN_*`). No `appsettings.json` — the supervisor owns the launch environment. See [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process).
### Environment overrides
Standard .NET config layering applies: `appsettings.{Environment}.json`, then environment variables with `Section__Property` naming. `DOTNET_ENVIRONMENT` (or `ASPNETCORE_ENVIRONMENT` for Admin) selects the overlay.
---
## Authoritative configuration (Config DB)
The Config DB is the single source of truth for every setting that a v1 deployment used to carry in `appsettings.json` as driver-specific state. `OtOpcUaConfigDbContext` (`src/ZB.MOM.WW.OtOpcUa.Configuration/OtOpcUaConfigDbContext.cs`) is the EF Core context used by both the Admin writer and every Server reader.
### Top-level sections operators touch
| Concept | Entity | Admin UI surface | Purpose |
|---|---|---|---|
| Cluster | `ServerCluster` | Clusters pages | Fleet unit; owns nodes, generations, UNS, ACLs. |
| Cluster node | `ClusterNode` + `ClusterNodeCredential` | RedundancyTab, Hosts page | Per-node identity, `RedundancyRole`, `ServiceLevelBase`, ApplicationUri, service-account credentials. |
| Generation | `ConfigGeneration` + `ClusterNodeGenerationState` | Generations / DiffViewer | Append-only; draft → publish workflow (`sp_PublishGeneration`). |
| Namespace | `Namespace` | Namespaces tab | Per-cluster OPC UA namespace; `Kind` = Equipment / SystemPlatform / Simulated. |
| Driver instance | `DriverInstance` | Drivers tab | Configured driver (Modbus, S7, OpcUaClient, Galaxy, …) + `DriverConfig` JSON + resilience profile. |
| Device | `Device` | Under each driver instance | Per-host settings inside a driver instance (IP, port, unit-id…). |
| UNS hierarchy | `UnsArea` + `UnsLine` | UnsTab (drag/drop) | L3 / L4 of the unified namespace. |
| Equipment | `Equipment` | Equipment pages, CSV import | L5; carries `MachineCode`, `ZTag`, `SAPID`, `EquipmentUuid`, reservation-backed external ids. |
| Tag | `Tag` | Under each equipment | Driver-specific tag address + `SecurityClassification` + poll-group assignment. |
| Poll group | `PollGroup` | Driver-scoped | Poll cadence buckets; `PollGroupEngine` in Core.Abstractions uses this at runtime. |
| ACL | `NodeAcl` | AclsTab + Probe dialog | Per-level permission grants, additive only. See [`security.md`](security.md#data-plane-authorization). |
| Role grant | `LdapGroupRoleMapping` | RoleGrants page | Maps LDAP groups → Admin roles (`ConfigViewer` / `ConfigEditor` / `FleetAdmin`). |
| External id reservation | `ExternalIdReservation` | Reservations page | Reservation-backed `ZTag` and `SAPID` uniqueness. |
| Equipment import batch | `EquipmentImportBatch` | CSV import flow | Staged bulk-add with validation preview. |
| Audit log | `ConfigAuditLog` | Audit page | Append-only record of every publish, rollback, credential rotation, role-grant change. |
### Draft → publish generation model
All edits go into a **draft** generation scoped to one cluster. `DraftValidationService` checks invariants (same-cluster FKs, reservation collisions, UNS path consistency, ACL scope validity). When the operator clicks Publish, `sp_PublishGeneration` atomically promotes the draft, records the audit event, and causes every `RedundancyCoordinator.RefreshAsync` in the affected cluster to pick up the new topology + ACL set. The Admin UI `DiffViewer` shows exactly what's changing before publish.
Old generations are retained; rollback is "publish older generation as new". `ConfigAuditLog` makes every change auditable by principal + timestamp.
### Offline cache
Each Server process caches the last-seen published generation in `Node:LocalCachePath` via LiteDB (`LiteDbConfigCache` in `src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/`). The cache lets a node start without the central DB reachable; once the DB comes back, `NodeBootstrap` syncs to the current generation.
### Full schema reference
For table columns, indexes, stored procedures, the publish-transaction semantics, and the SQL authorization model (per-node SQL principals + `SESSION_CONTEXT` cluster binding), see [`docs/v2/config-db-schema.md`](v2/config-db-schema.md).
### Admin UI flow
For the draft editor, DiffViewer, CSV import, IdentificationFields, RedundancyTab, AclsTab + Probe-this-permission, RoleGrants, and the SignalR real-time surface, see [`docs/v2/admin-ui.md`](v2/admin-ui.md).
---
## Where did v1 appsettings sections go?
Quick index for operators coming from v1 LmxOpcUa:
| v1 appsettings section | v2 home |
|---|---|
| `OpcUa.Port` / `BindAddress` / `EndpointPath` / `ServerName` | Bootstrap `OpcUaServer:EndpointUrl` + `ApplicationName`. |
| `OpcUa.ApplicationUri` | Config DB `ClusterNode.ApplicationUri`. |
| `OpcUa.MaxSessions` / `SessionTimeoutMinutes` | Bootstrap `OpcUaServer:*` (if exposed) or stack defaults. |
| `OpcUa.AlarmTrackingEnabled` / `AlarmFilter` | Per driver instance in Config DB (alarm surface is capability-driven per `IAlarmSource`). |
| `MxAccess.*` | Galaxy driver instance `DriverConfig` JSON + Galaxy.Host env vars (see [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process)). |
| `GalaxyRepository.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_GALAXY_ZB_CONN` env var. |
| `Dashboard.*` | Retired — Admin UI replaces the dashboard. See [`StatusDashboard.md`](StatusDashboard.md). |
| `Historian.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_HISTORIAN_*` env vars. |
| `Authentication.Ldap.*` | Bootstrap `OpcUaServer:Ldap` (same shape) + Admin `Authentication:Ldap` for the UI login. |
| `Security.*` | Bootstrap `OpcUaServer:SecurityProfile` + `PkiStoreRoot` + `AutoAcceptUntrustedClientCertificates`. |
| `Redundancy.*` | Config DB `ClusterNode.RedundancyRole` + `ServiceLevelBase`. |
---
## Validation
- **Bootstrap**: the process fails fast on missing required keys in `Program.cs` (e.g. `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString` all throw `InvalidOperationException` if unset).
- **Authoritative**: `DraftValidationService` runs on every save; `sp_ValidateDraft` runs as part of `sp_PublishGeneration` so an invalid draft cannot reach any node.

View File

@@ -1,84 +1,65 @@
# Data Type Mapping
`MxDataTypeMapper` and `SecurityClassificationMapper` translate Galaxy attribute metadata into OPC UA variable node properties. These mappings determine how Galaxy runtime values are represented to OPC UA clients and whether clients can write to them.
Data-type mapping is driver-defined. Each driver translates its native attribute metadata into two driver-agnostic enums from `Core.Abstractions``DriverDataType` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverDataType.cs`) and `SecurityClassification` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/SecurityClassification.cs`) — and populates the `DriverAttributeInfo` record it hands to `IAddressSpaceBuilder.Variable(...)`. Core doesn't interpret the native types; it trusts the driver's translation.
## mx_data_type to OPC UA Type Mapping
## DriverDataType OPC UA built-in type
Each Galaxy attribute carries an `mx_data_type` integer that identifies its data type. `MxDataTypeMapper.MapToOpcUaDataType` maps these to OPC UA built-in type NodeIds:
`DriverNodeManager.MapDataType` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) is the single translation table for every driver:
| mx_data_type | Galaxy type | OPC UA type | NodeId | CLR type |
|:---:|-------------|-------------|:------:|----------|
| 1 | Boolean | Boolean | i=1 | `bool` |
| 2 | Integer | Int32 | i=6 | `int` |
| 3 | Float | Float | i=10 | `float` |
| 4 | Double | Double | i=11 | `double` |
| 5 | String | String | i=12 | `string` |
| 6 | Time | DateTime | i=13 | `DateTime` |
| 7 | ElapsedTime | Double | i=11 | `double` |
| 8 | Reference | String | i=12 | `string` |
| 13 | Enumeration | Int32 | i=6 | `int` |
| 14 | Custom | String | i=12 | `string` |
| 15 | InternationalizedString | LocalizedText | i=21 | `string` |
| 16 | Custom | String | i=12 | `string` |
| other | Unknown | String | i=12 | `string` |
| DriverDataType | OPC UA NodeId |
|---|---|
| `Boolean` | `DataTypeIds.Boolean` (i=1) |
| `Int32` | `DataTypeIds.Int32` (i=6) |
| `Float32` | `DataTypeIds.Float` (i=10) |
| `Float64` | `DataTypeIds.Double` (i=11) |
| `String` | `DataTypeIds.String` (i=12) |
| `DateTime` | `DataTypeIds.DateTime` (i=13) |
| anything else | `DataTypeIds.BaseDataType` |
Unknown types default to String. This is a safe fallback because MXAccess delivers values as COM `VARIANT` objects, and string serialization preserves any value that does not have a direct OPC UA counterpart.
The enum also carries `Int16 / Int64 / UInt16 / UInt32 / UInt64 / Reference` members for drivers that need them; the mapping table is extended as those types surface in actual drivers. `Reference` is the Galaxy-style attribute reference — it's encoded as an OPC UA `String` on the wire.
### Why ElapsedTime maps to Double
## Per-driver mappers
Galaxy `ElapsedTime` (mx_data_type 7) represents a duration/timespan. OPC UA has no native `TimeSpan` type. The OPC UA specification defines a `Duration` type alias (NodeId i=290) that is semantically a `Double` representing milliseconds, but the simpler approach is to map directly to `Double` (i=11) representing seconds. This avoids ambiguity about whether the value is in seconds or milliseconds and matches how the Galaxy runtime exposes elapsed time values through MXAccess.
Each driver owns its native → `DriverDataType` translation:
## Array Handling
- **Galaxy Proxy** — `GalaxyProxyDriver.MapDataType(int mxDataType)` and `MapSecurity(int mxSec)` (inline in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/GalaxyProxyDriver.cs`). The Galaxy `mx_data_type` integer is sent across the Host↔Proxy pipe and mapped on the Proxy side. Galaxy's full classic 16-entry table (Boolean / Integer / Float / Double / String / Time / ElapsedTime / Reference / Enumeration / Custom / InternationalizedString) is preserved but compressed into the seven-entry `DriverDataType` enum — `ElapsedTime``Float64`, `InternationalizedString``String`, `Reference``Reference`, enumerations → `Int32`.
- **AB CIP** — `src/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDataType.cs` maps CIP tag type codes.
- **Modbus** — `src/ZB.MOM.WW.OtOpcUa.Driver.Modbus/ModbusDriver.cs` maps register shapes (16-bit signed, 16-bit unsigned, 32-bit float, etc.) including the DirectLogic quirk table in `DirectLogicAddress.cs`.
- **S7 / AB Legacy / TwinCAT / FOCAS / OPC UA Client** — each has its own inline mapper or `*DataType.cs` file per the same pattern.
Galaxy attributes with `is_array = 1` in the repository are exposed as one-dimensional OPC UA array variables.
The driver's mapping is authoritative — when a field type is ambiguous (a `LREAL` that could be bit-reinterpreted, a BCD counter, a string of a particular encoding), the driver decides the exposed OPC UA shape.
### ValueRank
## Array handling
The `ValueRank` property on the OPC UA variable node indicates the array dimensionality:
`DriverAttributeInfo.IsArray = true` flips `ValueRank = OneDimension` on the generated `BaseDataVariableState`; scalars stay at `ValueRank.Scalar`. `DriverAttributeInfo.ArrayDim` carries the declared length. Writing element-by-element (OPC UA `IndexRange`) is a driver-level decision — see `docs/ReadWriteOperations.md`.
| `is_array` | ValueRank | Constant |
|:---:|:---------:|----------|
| 0 | -1 | `ValueRanks.Scalar` |
| 1 | 1 | `ValueRanks.OneDimension` |
## SecurityClassification — metadata, not ACL
### ArrayDimensions
`SecurityClassification` is driver-reported metadata only. Drivers never enforce write permissions themselves — the classification flows into the Server project where `WriteAuthzPolicy.IsAllowed(classification, userRoles)` (`src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs`) gates the write against the session's LDAP-derived roles, and (Phase 6.2) the `AuthorizationGate` + permission trie apply on top. This is the "ACL at server layer" invariant recorded in `feedback_acl_at_server_layer.md`.
When `ValueRank = 1`, the `ArrayDimensions` property is set to a single-element `ReadOnlyList<uint>` containing the declared array length from `array_dimension`:
The classification values mirror the v1 Galaxy model so existing Galaxy galaxies keep their published semantics:
```csharp
if (attr.IsArray && attr.ArrayDimension.HasValue)
{
variable.ArrayDimensions = new ReadOnlyList<uint>(
new List<uint> { (uint)attr.ArrayDimension.Value });
}
```
| SecurityClassification | Required role | Write-from-OPC-UA |
|---|---|---|
| `FreeAccess` | — | yes (even anonymous) |
| `Operate` | `WriteOperate` | yes |
| `Tune` | `WriteTune` | yes |
| `Configure` | `WriteConfigure` | yes |
| `SecuredWrite` | `WriteOperate` | yes |
| `VerifiedWrite` | `WriteConfigure` | yes |
| `ViewOnly` | — | no |
The `array_dimension` value is extracted from the `mx_value` binary column in the Galaxy database (bytes 13-16, little-endian int32).
Drivers whose backend has no notion of classification (Modbus, most PLCs) default every tag to `FreeAccess` or `Operate`; drivers whose backend does carry the notion (Galaxy, OPC UA Client relaying `UserAccessLevel`) translate it directly.
### NodeId for array variables
## Historization
Array variables use a NodeId without the `[]` suffix. The `full_tag_reference` stored internally for MXAccess addressing retains the `[]` (e.g., `MESReceiver_001.MoveInPartNumbers[]`), but the OPC UA NodeId strips it to `ns=1;s=MESReceiver_001.MoveInPartNumbers`.
## Security Classification to AccessLevel Mapping
Galaxy attributes carry a `security_classification` value that controls write permissions. `SecurityClassificationMapper.IsWritable` determines the OPC UA `AccessLevel`:
| security_classification | Galaxy level | OPC UA AccessLevel | Writable |
|:---:|--------------|-------------------|:--------:|
| 0 | FreeAccess | CurrentReadOrWrite | Yes |
| 1 | Operate | CurrentReadOrWrite | Yes |
| 2 | SecuredWrite | CurrentRead | No |
| 3 | VerifiedWrite | CurrentRead | No |
| 4 | Tune | CurrentReadOrWrite | Yes |
| 5 | Configure | CurrentReadOrWrite | Yes |
| 6 | ViewOnly | CurrentRead | No |
Most attributes default to Operate (1). The mapper treats SecuredWrite, VerifiedWrite, and ViewOnly as read-only because the OPC UA server does not implement the Galaxy's multi-level authentication model. Allowing writes to SecuredWrite or VerifiedWrite attributes without proper verification would bypass Galaxy security.
For historized attributes, `AccessLevels.HistoryRead` is added to the access level via bitwise OR, enabling OPC UA history read requests when an `IHistorianDataSource` is configured via the runtime-loaded historian plugin.
`DriverAttributeInfo.IsHistorized = true` flips `AccessLevel.HistoryRead` and `Historizing = true` on the variable. The driver must then implement `IHistoryProvider` for HistoryRead service calls to succeed; otherwise the node manager surfaces `BadHistoryOperationUnsupported` per request.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/MxDataTypeMapper.cs` -- Type and CLR mapping
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/SecurityClassificationMapper.cs` -- Write access mapping
- `gr/data_type_mapping.md` -- Reference documentation for the full mapping table
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverDataType.cs` — driver-agnostic type enum
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/SecurityClassification.cs` — write-authz tier metadata
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverAttributeInfo.cs` — per-attribute descriptor
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs``MapDataType` translation
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs` — classification-to-role policy
- Per-driver mappers in each `Driver.*` project

View File

@@ -1,228 +1,109 @@
# Historical Data Access
`LmxNodeManager` exposes OPC UA historical data access (HDA) through an abstract `IHistorianDataSource` interface (`Historian/IHistorianDataSource.cs`). The Wonderware Historian implementation lives in a separate assembly, `ZB.MOM.WW.OtOpcUa.Historian.Aveva`, which is loaded at runtime only when `Historian.Enabled=true`. This keeps the `aahClientManaged` SDK out of the core Host so deployments that do not need history do not need the SDK installed.
OPC UA HistoryRead is a **per-driver optional capability** in OtOpcUa. The Core dispatches HistoryRead service calls to the owning driver through the `IHistoryProvider` capability interface (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IHistoryProvider.cs`). Drivers that don't implement the interface return `BadHistoryOperationUnsupported` for every history call on their nodes; that is the expected behavior for protocol drivers (Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS) whose wire protocols carry no time-series data.
## Plugin Architecture
Historian integration is no longer a separate bolt-on assembly, as it was in v1 (`ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin). It is now one optional capability any driver can implement. The first implementation is the Galaxy driver's Wonderware Historian integration; OPC UA Client forwards HistoryRead to the upstream server. Every other driver leaves the capability unimplemented and the Core short-circuits history calls on nodes that belong to those drivers.
The historian surface is split across two assemblies:
## `IHistoryProvider`
- **`ZB.MOM.WW.OtOpcUa.Host`** (core) owns only OPC UA / BCL types:
- `IHistorianDataSource` -- the interface `LmxNodeManager` depends on
- `HistorianEventDto` -- SDK-free representation of a historian event record
- `HistorianAggregateMap` -- maps OPC UA aggregate NodeIds to AnalogSummary column names
- `HistorianPluginLoader` -- loads the plugin via `Assembly.LoadFrom` at startup
- `HistoryContinuationPointManager` -- paginates HistoryRead results
- **`ZB.MOM.WW.OtOpcUa.Historian.Aveva`** (plugin) owns everything SDK-bound:
- `HistorianDataSource` -- implements `IHistorianDataSource`, wraps `aahClientManaged`
- `IHistorianConnectionFactory` / `SdkHistorianConnectionFactory` -- opens and polls `ArchestrA.HistorianAccess` connections
- `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` -- the static factory invoked by the loader
Four methods, mapping onto the four OPC UA HistoryRead service variants:
The plugin assembly and its SDK dependencies (`aahClientManaged.dll`, `aahClient.dll`, `aahClientCommon.dll`, `Historian.CBE.dll`, `Historian.DPAPI.dll`, `ArchestrA.CloudHistorian.Contract.dll`) deploy to a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`. See [Service Hosting](ServiceHosting.md#required-runtime-assemblies) for the full layout and deployment matrix.
| Method | OPC UA service | Notes |
|--------|----------------|-------|
| `ReadRawAsync` | HistoryReadRawModified (raw subset) | Returns `HistoryReadResult { Samples, ContinuationPoint? }`. The Core handles `ContinuationPoint` pagination. |
| `ReadProcessedAsync` | HistoryReadProcessed | Takes a `HistoryAggregateType` (Average / Minimum / Maximum / Total / Count) and a bucket `interval`. Drivers that can't express an aggregate throw `NotSupportedException`; the Core translates that into `BadAggregateNotSupported`. |
| `ReadAtTimeAsync` | HistoryReadAtTime | Default implementation throws `NotSupportedException` — drivers without interpolation / prior-boundary support leave the default. |
| `ReadEventsAsync` | HistoryReadEvents | Historical alarm/event rows, distinct from the live `IAlarmSource` stream. Default throws; only drivers with an event historian (Galaxy's A&E log) override. |
## Plugin Loading
Supporting DTOs live alongside the interface in `Core.Abstractions`:
When the service starts with `Historian.Enabled=true`, `OpcUaService` calls `HistorianPluginLoader.TryLoad(config)`. The loader:
- `HistoryReadResult(IReadOnlyList<DataValueSnapshot> Samples, byte[]? ContinuationPoint)`
- `HistoryAggregateType` — enum `{ Average, Minimum, Maximum, Total, Count }`
- `HistoricalEvent(EventId, SourceName?, EventTimeUtc, ReceivedTimeUtc, Message?, Severity)`
- `HistoricalEventsResult(IReadOnlyList<HistoricalEvent> Events, byte[]? ContinuationPoint)`
1. Probes `AppDomain.CurrentDomain.BaseDirectory\Historian\ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll`.
2. Installs a one-shot `AppDomain.AssemblyResolve` handler that redirects any `aahClientManaged`/`aahClientCommon`/`Historian.*` lookups to the same subfolder, so the CLR can resolve SDK dependencies when the plugin first JITs.
3. Calls the plugin's `AvevaHistorianPluginEntry.Create(HistorianConfiguration)` via reflection and returns the resulting `IHistorianDataSource`.
4. On any failure (plugin missing, entry type not found, SDK assembly unresolvable, bad image), logs a warning with the expected plugin path and returns `null`. The server starts normally and `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call.
## Dispatch through `CapabilityInvoker`
## Wonderware Historian SDK
All four HistoryRead surfaces are wrapped by `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`) with `DriverCapability.HistoryRead`. The Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability.HistoryRead)` provides timeout, circuit-breaker, and bulkhead defaults per the driver's stability tier (see [docs/v2/driver-stability.md](v2/driver-stability.md)).
The plugin uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data. The SDK provides a cursor-based query API through `ArchestrA.HistorianAccess`, replacing direct SQL queries against the Historian Runtime database. Two query types are used:
The dispatch point is `DriverNodeManager` in `ZB.MOM.WW.OtOpcUa.Server`. When the OPC UA stack calls `HistoryRead`, the node manager:
- **`HistoryQuery`** -- Raw historical samples with timestamp, value (numeric or string), and OPC quality.
- **`AnalogSummaryQuery`** -- Pre-computed aggregates with properties for Average, Minimum, Maximum, ValueCount, First, Last, StdDev, and more.
1. Resolves the target `NodeHandle` to a `(DriverInstanceId, fullReference)` pair.
2. Checks the owning driver's `DriverTypeMetadata` to see if the type may advertise history at all (fast reject for types that never implement `IHistoryProvider`).
3. If the driver instance implements `IHistoryProvider`, wraps the `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` call in `CapabilityInvoker.InvokeAsync(... DriverCapability.HistoryRead ...)`.
4. Translates the `HistoryReadResult` into an OPC UA `HistoryData` + `ExtensionObject`.
5. Manages the continuation point via `HistoryContinuationPointManager` so clients can page through large result sets.
The SDK DLLs are located in `lib/` and originate from `C:\Program Files (x86)\Wonderware\Historian\`. Only the plugin project (`src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/`) references them at build time; the core Host project does not.
Driver-level history code never sees the continuation-point protocol or the OPC UA stack types — those stay in the Core.
## Configuration
## Driver coverage
`HistorianConfiguration` controls the SDK connection:
| Driver | Implements `IHistoryProvider`? | Source |
|--------|:------------------------------:|--------|
| Galaxy | Yes — raw, processed, at-time, events | `aahClientManaged` SDK (Wonderware Historian) on the Host side, forwarded through the Proxy's IPC |
| OPC UA Client | Yes — raw, processed, at-time, events (forwarded to upstream) | `Opc.Ua.Client.Session.HistoryRead` against the remote server |
| Modbus | No | Wire protocol has no time-series concept |
| Siemens S7 | No | S7comm has no time-series concept |
| AB CIP | No | CIP has no time-series concept |
| AB Legacy | No | PCCC has no time-series concept |
| TwinCAT | No | ADS symbol reads are point-in-time; archiving is an external concern |
| FOCAS | No | Default — FOCAS has no general-purpose historian API |
```csharp
public class HistorianConfiguration
{
public bool Enabled { get; set; } = false;
public string ServerName { get; set; } = "localhost";
public List<string> ServerNames { get; set; } = new();
public int FailureCooldownSeconds { get; set; } = 60;
public bool IntegratedSecurity { get; set; } = true;
public string? UserName { get; set; }
public string? Password { get; set; }
public int Port { get; set; } = 32568;
public int CommandTimeoutSeconds { get; set; } = 30;
public int MaxValuesPerRead { get; set; } = 10000;
public int RequestTimeoutSeconds { get; set; } = 60;
}
```
## Galaxy — Wonderware Historian (`aahClientManaged`)
When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plugin is loaded, and the node manager returns `BadHistoryOperationUnsupported` for history read requests. When `Enabled` is `true` but the plugin cannot be loaded (missing `Historian/` subfolder, SDK assembly resolve failure, etc.), the server still starts and returns the same `BadHistoryOperationUnsupported` status with a warning in the log.
The Galaxy driver's `IHistoryProvider` implementation lives on the Host side (`.NET 4.8 x86`) in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Historian/`. The Proxy's `GalaxyProxyDriver.ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` each serializes a `HistoryRead*Request` and awaits the matching `HistoryRead*Response` over the named pipe (see [drivers/Galaxy.md](drivers/Galaxy.md#ipc-transport)).
### Connection Properties
Host-side, `HistorianDataSource` uses the AVEVA Historian managed SDK (`aahClientManaged.dll`) to query historical data via a cursor-based API through `ArchestrA.HistorianAccess`:
| Property | Default | Description |
|---|---|---|
| `ServerName` | `localhost` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments |
| `ServerNames` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover (see [Cluster Failover](#read-only-cluster-failover)) |
| `FailureCooldownSeconds` | `60` | How long a failed cluster node is skipped before being re-tried. Zero means no cooldown (retry on every request) |
| `IntegratedSecurity` | `true` | Use Windows authentication |
| `UserName` | `null` | Username when `IntegratedSecurity` is false |
| `Password` | `null` | Password when `IntegratedSecurity` is false |
| `Port` | `32568` | Historian TCP port |
| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds (inner async bound) |
| `RequestTimeoutSeconds` | `60` | Outer safety timeout applied to sync-over-async history reads on the OPC UA stack thread. Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Should be greater than `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
| `MaxValuesPerRead` | `10000` | Maximum values per history read request |
- **`HistoryQuery`** — raw historical samples (timestamp, value, OPC quality)
- **`AnalogSummaryQuery`** — pre-computed aggregates (Average, Minimum, Maximum, ValueCount, First, Last, StdDev)
## Connection Lifecycle
The SDK DLLs are pulled into the Galaxy.Host project at build time; the Server and every other driver project remain SDK-free.
`HistorianDataSource` (in the plugin assembly) maintains a persistent connection to the Historian server via `ArchestrA.HistorianAccess`:
> **Gap / status note.** The raw SDK wrapper (`HistorianDataSource`, `HistorianClusterEndpointPicker`, `HistorianHealthSnapshot`, etc.) has been ported from the v1 `ZB.MOM.WW.LmxOpcUa.Historian.Aveva` plugin into `Driver.Galaxy.Host/Backend/Historian/`. The **IPC wire-up** — `HistoryReadRequest` / `HistoryReadResponse` message kinds, Proxy-side `ReadRawAsync` / `ReadProcessedAsync` / `ReadAtTimeAsync` / `ReadEventsAsync` forwarding — is in place on `GalaxyProxyDriver`. What remains to close on a given branch is Host-side **mapping of `HistoryAggregateType` onto the `AnalogSummaryQuery` column names** (done in `GalaxyProxyDriver.MapAggregateToColumn`; the Host side must mirror it) and the **end-to-end integration test** that was held by the v1 plugin suite. Until those land on a given driver branch, history calls against Galaxy may surface `GalaxyIpcException { Code = "not-implemented" }` or backend-specific errors rather than populated `HistoryReadResult`s. Track the remaining work against the Phase 2 Galaxy out-of-process gate in `docs/v2/plan.md`.
1. **Lazy connect** -- The connection is established on the first query via `EnsureConnected()`. When a cluster is configured, the data source iterates `HistorianClusterEndpointPicker.GetHealthyNodes()` in order and returns the first node that successfully connects.
2. **Connection reuse** -- Subsequent queries reuse the same connection. The active node is tracked in `_activeProcessNode` / `_activeEventNode` and surfaced on the dashboard.
3. **Auto-reconnect** -- On connection failure, the connection is disposed, the active node is marked failed in the picker, and the next query re-enters the picker loop to try the next eligible candidate.
4. **Clean shutdown** -- `Dispose()` closes the connection when the service stops.
### Aggregate function mapping
The connection is opened with `ReadOnly = true` and `ConnectionType = Process`. The event (alarm history) path uses a separate connection with `ConnectionType = Event`, but both silos share the same cluster picker so a node that fails on one silo is immediately skipped on the other.
`GalaxyProxyDriver.MapAggregateToColumn` (Proxy-side) translates the OPC UA Part 13 standard aggregate enum onto `AnalogSummaryQuery` column names consumed by `HistorianDataSource.ReadAggregateAsync`:
## Read-Only Cluster Failover
| `HistoryAggregateType` | Result Property |
|------------------------|-----------------|
| `Average` | `Average` |
| `Minimum` | `Minimum` |
| `Maximum` | `Maximum` |
| `Count` | `ValueCount` |
When `HistorianConfiguration.ServerNames` is non-empty, the plugin picks from an ordered list of cluster nodes instead of a single `ServerName`. Each connection attempt tries candidates in configuration order until one succeeds. Failed nodes are placed into a timed cooldown and re-admitted when the cooldown elapses.
`HistoryAggregateType.Total` is **not supported** by Wonderware `AnalogSummary` and raises `NotSupportedException`, which the Core translates to `BadAggregateNotSupported`. Additional OPC UA aggregates (`Start`, `End`, `StandardDeviationPopulation`) sit on the Historian columns `First`, `Last`, `StdDev` and can be exposed by extending the enum + mapping together.
### HistorianClusterEndpointPicker
### Read-only cluster failover
The picker (in the plugin assembly, internal) is pure logic with no SDK dependency — all cluster behavior is unit-testable with a fake clock and scripted factory. Key characteristics:
`HistorianConfiguration.ServerNames` accepts an ordered list of cluster nodes. `HistorianClusterEndpointPicker` iterates the list in configuration order, marks failed nodes with a `FailureCooldownSeconds` window, and re-admits them when the cooldown elapses. One picker instance is shared by the process-values connection and the event-history connection (two SDK silos), so a node failure on one silo immediately benches it for the other. `FailureCooldownSeconds = 0` disables the cooldown — the SDK's own retry semantics are the sole gate.
- **Ordered iteration**: nodes are tried in the exact order they appear in `ServerNames`. Operators can express a preference ("primary first, fallback second") by ordering the list.
- **Per-node cooldown**: `MarkFailed(node, error)` starts a `FailureCooldownSeconds` window during which the node is skipped from `GetHealthyNodes()`. `MarkHealthy(node)` clears the window immediately (used on successful connect).
- **Automatic re-admission**: when a node's cooldown elapses, the next call to `GetHealthyNodes()` includes it automatically — no background probe, no manual reset. The cumulative `FailureCount` and `LastError` are retained for operator diagnostics.
- **Thread-safe**: a single lock guards the per-node state. Operations are microsecond-scale so contention is a non-issue.
- **Shared across silos**: one picker instance is shared by the process-values connection and the event-history connection, so a node failure on one path immediately benches it for the other.
- **Zero cooldown mode**: `FailureCooldownSeconds = 0` disables the cooldown entirely — the node is never benched. Useful for tests or for operators who want the SDK's own retry semantics to be the sole gate.
Host-side cluster health is surfaced via `HistorianHealthSnapshot { NodeCount, HealthyNodeCount, ActiveProcessNode, ActiveEventNode, Nodes }` and forwarded to the Proxy so the Admin UI Historian panel can render a per-node table. `HealthCheckService` flips overall service health to `Degraded` when `HealthyNodeCount < NodeCount`.
### Connection attempt flow
### Runtime health counters
`HistorianDataSource.ConnectToAnyHealthyNode(HistorianConnectionType)` performs the actual iteration:
`HistorianDataSource` maintains per-read counters — `TotalQueries`, `TotalSuccesses`, `TotalFailures`, `ConsecutiveFailures`, `LastSuccessTime`, `LastFailureTime`, `LastError`, `ProcessConnectionOpen`, `EventConnectionOpen` — so the dashboard can distinguish "backend loaded but never queried" from "backend loaded and queries are failing". `LastError` is prefixed with the read path (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which silo is broken. `HealthCheckService` degrades at `ConsecutiveFailures >= 3`.
1. Snapshot healthy nodes from the picker. If empty, throw `InvalidOperationException` with either "No historian nodes configured" or "All N historian nodes are in cooldown".
2. For each candidate, clone `HistorianConfiguration` with the candidate as `ServerName` and pass it to the factory. On success: `MarkHealthy(node)` and return the `(Connection, Node)` tuple. On exception: `MarkFailed(node, ex.Message)`, log a warning, continue.
3. If all candidates fail, wrap the last inner exception in an `InvalidOperationException` with the cumulative failure count so the existing read-method catch blocks surface a meaningful error through the health counters.
### Quality mapping
The wrapping exception intentionally includes the last inner error message in the outer `Message` so the health snapshot's `LastError` field is still human-readable when the cluster exhausts every candidate.
### Single-node backward compatibility
When `ServerNames` is empty, the picker is seeded with a single entry from `ServerName` and the iteration loop still runs — it just has one candidate. Legacy deployments see no behavior change: the picker marks the single node healthy on success, runs the same cooldown logic on failure, and the dashboard renders a compact `Node: <hostname>` line instead of the cluster table.
### Cluster health surface
Runtime cluster state is exposed on `HistorianHealthSnapshot`:
- `NodeCount` / `HealthyNodeCount` -- size of the configured cluster and how many are currently eligible.
- `ActiveProcessNode` / `ActiveEventNode` -- which nodes are currently serving the two connection silos, or `null` when a silo has no open connection.
- `Nodes: List<HistorianClusterNodeState>` -- per-node state with `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`.
The dashboard renders this as a cluster table when `NodeCount > 1`. See [Status Dashboard](StatusDashboard.md#historian). `HealthCheckService` flips the overall service health to `Degraded` when `HealthyNodeCount < NodeCount` so operators can alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
## Runtime Health Counters
`HistorianDataSource` maintains runtime query counters updated on every read method exit — success or failure — so the dashboard can distinguish "plugin loaded but never queried" from "plugin loaded and queries are failing". The load-time `HistorianPluginLoader.LastOutcome` only reports whether the assembly resolved at startup; it cannot catch a connection that succeeds at boot and degrades later.
### Counters
- `TotalQueries` / `TotalSuccesses` / `TotalFailures` — cumulative since startup. Every call to `RecordSuccess` or `RecordFailure` in the read methods updates these under `_healthLock`. Empty result sets count as successes — the counter reflects "the SDK call returned" rather than "the SDK call returned data".
- `ConsecutiveFailures` — latches while queries are failing; reset to zero by the first success. Drives `HealthCheckService` degradation at threshold 3.
- `LastSuccessTime` / `LastFailureTime` — UTC timestamps of the most recent success or failure, or `null` when no query of that outcome has occurred yet.
- `LastError` — exception message from the most recent failure, prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call is broken. Cleared on the next success.
- `ProcessConnectionOpen` / `EventConnectionOpen` — whether the plugin currently holds an open SDK connection on each silo. Read from the data source's `_connection` / `_eventConnection` fields via a `Volatile.Read`.
These fields are read once per dashboard refresh via `IHistorianDataSource.GetHealthSnapshot()` and serialized into `HistorianStatusInfo`. See [Status Dashboard](StatusDashboard.md#historian) for the HTML/JSON surface.
### Two SDK connection silos
The plugin maintains two independent `ArchestrA.HistorianAccess` connections, one per `HistorianConnectionType`:
- **Process connection** (`ConnectionType = Process`) — serves historical *value* queries: `ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`. This is the SDK's query channel for tags stored in the Historian runtime.
- **Event connection** (`ConnectionType = Event`) — serves historical *event/alarm* queries: `ReadEventsAsync`. The SDK requires a separately opened connection for its event store because the query API and wire schema are distinct from value queries.
Both connections are lazy: they open on the first query that needs them. Either can be open, closed, or open against a different cluster node than the other. The dashboard renders both independently in the Historian panel (`Process Conn: open (host-a) | Event Conn: closed`) so operators can tell which silos are active and which node is serving each. When cluster support is configured, both silos share the same `HistorianClusterEndpointPicker`, so a failure on one silo marks the node unhealthy for the other as well.
## Raw Reads
`IHistorianDataSource.ReadRawAsync` (plugin implementation) uses a `HistoryQuery` to retrieve individual samples within a time range:
1. Create a `HistoryQuery` via `_connection.CreateHistoryQuery()`
2. Configure `HistoryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `RetrievalMode = Full`
3. Iterate: `StartQuery` -> `MoveNext` loop -> `EndQuery`
Each result row is converted to an OPC UA `DataValue`:
- `QueryResult.Value` (double) takes priority; `QueryResult.StringValue` is used as fallback for string-typed tags.
- `SourceTimestamp` and `ServerTimestamp` are both set to `QueryResult.StartDateTime`.
- `StatusCode` is mapped from the `QueryResult.OpcQuality` (UInt16) via `QualityMapper` (the same OPC DA quality byte mapping used for live MXAccess data).
## Aggregate Reads
`IHistorianDataSource.ReadAggregateAsync` (plugin implementation) uses an `AnalogSummaryQuery` to retrieve pre-computed aggregates:
1. Create an `AnalogSummaryQuery` via `_connection.CreateAnalogSummaryQuery()`
2. Configure `AnalogSummaryQueryArgs` with `TagNames`, `StartDateTime`, `EndDateTime`, and `Resolution` (milliseconds)
3. Iterate the same `StartQuery` -> `MoveNext` -> `EndQuery` pattern
4. Extract the requested aggregate from named properties on `AnalogSummaryQueryResult`
Null aggregate values return `BadNoData` status rather than `Good` with a null variant.
## Quality Mapping
The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte is passed through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality` -> `MapToOpcUaStatusCode`), which maps the OPC DA quality families to OPC UA status codes:
The Historian SDK returns standard OPC DA quality values in `QueryResult.OpcQuality` (UInt16). The low byte flows through the shared `QualityMapper` pipeline (`MapFromMxAccessQuality``MapToOpcUaStatusCode`):
| OPC Quality Byte | OPC DA Family | OPC UA StatusCode |
|---|---|---|
|------------------|---------------|-------------------|
| 0-63 | Bad | `Bad` (with sub-code when an exact enum match exists) |
| 64-191 | Uncertain | `Uncertain` (with sub-code when an exact enum match exists) |
| 192+ | Good | `Good` (with sub-code when an exact enum match exists) |
See `Domain/QualityMapper.cs` and `Domain/Quality.cs` for the full mapping table and sub-code definitions.
See `Domain/QualityMapper.cs` and `Domain/Quality.cs` in `Driver.Galaxy.Host` for the full table.
## Aggregate Function Mapping
## OPC UA Client — upstream forwarding
`HistorianAggregateMap.MapAggregateToColumn` (in the core Host assembly, so the node manager can validate aggregate support without requiring the plugin to be loaded) translates OPC UA aggregate NodeIds to `AnalogSummaryQueryResult` property names:
The OPC UA Client driver (`Driver.OpcUaClient`) implements `IHistoryProvider` by forwarding each call to the upstream server via `Session.HistoryRead`. Raw / processed / at-time / events map onto the stack's native HistoryRead details types. Continuation points are passed through — the Core's `HistoryContinuationPointManager` treats the driver as an opaque pager.
| OPC UA Aggregate | Result Property |
|---|---|
| `AggregateFunction_Average` | `Average` |
| `AggregateFunction_Minimum` | `Minimum` |
| `AggregateFunction_Maximum` | `Maximum` |
| `AggregateFunction_Count` | `ValueCount` |
| `AggregateFunction_Start` | `First` |
| `AggregateFunction_End` | `Last` |
| `AggregateFunction_StandardDeviationPopulation` | `StdDev` |
## Historizing flag and AccessLevel
Unsupported aggregates return `null`, which causes the node manager to return `BadAggregateNotSupported`.
## HistoryReadRawModified Override
`LmxNodeManager` overrides `HistoryReadRawModified` to handle raw history read requests:
1. Resolve the `NodeHandle` to a tag reference via `_nodeIdToTagReference`. Return `BadNodeIdUnknown` if not found.
2. Check that `_historianDataSource` is not null. Return `BadHistoryOperationUnsupported` if historian is disabled.
3. Call `ReadRawAsync` with the time range and `NumValuesPerNode` from the `ReadRawModifiedDetails`.
4. Pack the resulting `DataValue` list into a `HistoryData` object and wrap it in an `ExtensionObject` for the `HistoryReadResult`.
## HistoryReadProcessed Override
`HistoryReadProcessed` handles aggregate history requests with additional validation:
1. Resolve the node and check historian availability (same as raw).
2. Validate that `AggregateType` is present in the `ReadProcessedDetails`. Return `BadAggregateListMismatch` if empty.
3. Map the requested aggregate to a result property via `MapAggregateToColumn`. Return `BadAggregateNotSupported` if unmapped.
4. Call `ReadAggregateAsync` with the time range, `ProcessingInterval`, and property name.
5. Return results in the same `HistoryData` / `ExtensionObject` format.
## Historizing Flag and AccessLevel
During variable node creation in `CreateAttributeVariable`, attributes with `IsHistorized == true` receive two additional settings:
During variable node creation, drivers that advertise history set:
```csharp
if (attr.IsHistorized)
@@ -230,7 +111,13 @@ if (attr.IsHistorized)
variable.Historizing = attr.IsHistorized;
```
- **`Historizing = true`** -- Tells OPC UA clients that this node has historical data available.
- **`AccessLevels.HistoryRead`** -- Enables the `HistoryRead` access bit on the node, which the OPC UA stack checks before routing history requests to the node manager override. Nodes without this bit set will be rejected by the framework before reaching `HistoryReadRawModified` or `HistoryReadProcessed`.
- **`Historizing = true`** — tells OPC UA clients that the node has historical data available.
- **`AccessLevels.HistoryRead`** — enables the `HistoryRead` access bit. The OPC UA stack checks this bit before routing history requests to the Core dispatcher; nodes without it are rejected before reaching `IHistoryProvider`.
The `IsHistorized` flag originates from the Galaxy repository database query, which checks whether the attribute has Historian logging configured.
The `IsHistorized` flag originates in the driver's discovery output. For Galaxy it comes from the repository query detecting a `HistoryExtension` primitive (see [drivers/Galaxy-Repository.md](drivers/Galaxy-Repository.md)). For OPC UA Client it is copied from the upstream server's `Historizing` property.
## Configuration
Driver-specific historian config lives in each driver's `DriverConfig` JSON blob, validated against the driver type's `DriverConfigJsonSchema` in `DriverTypeRegistry`. The Galaxy driver's historian section carries the fields exercised by `HistorianConfiguration``ServerName` / `ServerNames`, `FailureCooldownSeconds`, `IntegratedSecurity` / `UserName` / `Password`, `Port` (default `32568`), `CommandTimeoutSeconds`, `RequestTimeoutSeconds`, `MaxValuesPerRead`. The OPC UA Client driver inherits its timeouts from the upstream session.
See [Configuration.md](Configuration.md) for the schema shape and validation path.

View File

@@ -1,121 +1,65 @@
# Incremental Sync
When a Galaxy redeployment is detected, the OPC UA address space must be updated to reflect the new hierarchy and attributes. Rather than tearing down the entire address space and rebuilding from scratch (which disconnects all clients and drops all subscriptions), `LmxNodeManager` performs an incremental sync that identifies changed objects and rebuilds only the affected subtrees.
Two distinct change-detection paths feed the running server: driver-backend rediscovery (Galaxy's `time_of_last_deploy`, TwinCAT's symbol-version-changed, OPC UA Client's upstream namespace change) and generation-level config publishes from the Admin UI. Both flow into re-runs of `ITagDiscovery.DiscoverAsync`, but they originate differently.
## Cached State
## Driver-backend rediscovery — IRediscoverable
`LmxNodeManager` retains shallow copies of the last-published hierarchy and attributes:
Drivers whose backend has a native change signal implement `IRediscoverable` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IRediscoverable.cs`):
```csharp
private List<GalaxyObjectInfo>? _lastHierarchy;
private List<GalaxyAttributeInfo>? _lastAttributes;
```
These are updated at the end of every `BuildAddressSpace` or `SyncAddressSpace` call via `new List<T>(source)` to create independent copies. The copies serve as the baseline for the next diff comparison.
On the first call (when `_lastHierarchy` is null), `SyncAddressSpace` falls through to a full `BuildAddressSpace` since there is no baseline to diff against.
## AddressSpaceDiff
`AddressSpaceDiff` is a static helper class that computes the set of changed Galaxy object IDs between two snapshots.
### FindChangedGobjectIds
This method compares old and new hierarchy+attributes and returns a `HashSet<int>` of gobject IDs that have any difference. It detects three categories of changes:
**Added objects** -- Present in new hierarchy but not in old:
```csharp
foreach (var id in newObjects.Keys)
if (!oldObjects.ContainsKey(id))
changed.Add(id);
```
**Removed objects** -- Present in old hierarchy but not in new:
```csharp
foreach (var id in oldObjects.Keys)
if (!newObjects.ContainsKey(id))
changed.Add(id);
```
**Modified objects** -- Present in both but with different properties. `ObjectsEqual` compares `TagName`, `BrowseName`, `ContainedName`, `ParentGobjectId`, and `IsArea`.
**Attribute set changes** -- For objects that exist in both snapshots, attributes are grouped by `GobjectId` and compared pairwise. `AttributeSetsEqual` sorts both lists by `FullTagReference` and `PrimitiveName`, then checks each pair via `AttributesEqual`, which compares `AttributeName`, `FullTagReference`, `MxDataType`, `IsArray`, `ArrayDimension`, `PrimitiveName`, `SecurityClassification`, `IsHistorized`, and `IsAlarm`. A difference in count or any field mismatch marks the owning gobject as changed.
Objects already marked as changed by hierarchy comparison are skipped during attribute comparison to avoid redundant work.
### ExpandToSubtrees
When a Galaxy object changes, its children must also be rebuilt because they may reference the parent's node or have inherited attribute changes. `ExpandToSubtrees` performs a BFS traversal from each changed ID, adding all descendants:
```csharp
public static HashSet<int> ExpandToSubtrees(HashSet<int> changed,
List<GalaxyObjectInfo> hierarchy)
public interface IRediscoverable
{
var childrenByParent = hierarchy.GroupBy(h => h.ParentGobjectId)
.ToDictionary(g => g.Key, g => g.Select(h => h.GobjectId).ToList());
var expanded = new HashSet<int>(changed);
var queue = new Queue<int>(changed);
while (queue.Count > 0)
{
var id = queue.Dequeue();
if (childrenByParent.TryGetValue(id, out var children))
foreach (var childId in children)
if (expanded.Add(childId))
queue.Enqueue(childId);
}
return expanded;
event EventHandler<RediscoveryEventArgs>? OnRediscoveryNeeded;
}
public sealed record RediscoveryEventArgs(string Reason, string? ScopeHint);
```
The expansion runs against both the old and new hierarchy. This is necessary because a removed parent's children appear in the old hierarchy (for teardown) while an added parent's children appear in the new hierarchy (for construction).
The driver fires the event with a reason string (for the diagnostic log) and an optional scope hint — a non-null hint lets Core scope the rebuild surgically to that subtree; null means "the whole address space may have changed".
## SyncAddressSpace Flow
Drivers that implement the capability today:
`SyncAddressSpace` orchestrates the incremental update inside the OPC UA framework `Lock`:
- **Galaxy** — polls `galaxy.time_of_last_deploy` in the Galaxy repository DB and fires on change. This is Galaxy-internal change detection, not the platform-wide mechanism.
- **TwinCAT** — observes ADS symbol-version-changed notifications (`0x0702`).
- **OPC UA Client** — subscribes to the upstream server's `Server/NamespaceArray` change notifications.
1. **Diff** -- Call `FindChangedGobjectIds` with the cached and new snapshots. If no changes are detected, update the cached snapshots and return early.
Static drivers (Modbus, S7, AB CIP, AB Legacy, FOCAS) do not implement `IRediscoverable` — their tags only change when a new generation is published from the Config DB. Core sees absence of the interface and skips change-detection wiring for those drivers (decision #54).
2. **Expand** -- Call `ExpandToSubtrees` on both old and new hierarchies to include descendant objects.
## Config-DB generation publishes
3. **Snapshot subscriptions** -- Before teardown, iterate `_gobjectToTagRefs` for each changed gobject ID and record the current MXAccess subscription ref-counts. These are needed to restore subscriptions after rebuild.
Tag-set changes authored in the Admin UI (UNS edits, CSV imports, driver-config edits) accumulate in a draft generation and commit via `sp_PublishGeneration`. The delta between the currently-published generation and the proposed next one is computed by `sp_ComputeGenerationDiff`, which drives:
4. **Teardown** -- Call `TearDownGobjects` to remove the old nodes and clean up tracking state.
- The **DiffViewer** in Admin (`src/ZB.MOM.WW.OtOpcUa.Admin/Components/Pages/Clusters/DiffViewer.razor`) so operators can preview what will change before clicking Publish.
- The 409-on-stale-draft flow (decision #161) — a UNS drag-reorder preview carries a `DraftRevisionToken` so Confirm returns `409 Conflict / refresh-required` if the draft advanced between preview and commit.
5. **Rebuild** -- Filter the new hierarchy and attributes to only the changed gobject IDs, then call `BuildSubtree` to create the replacement nodes.
After publish, the server's generation applier invokes `IDriver.ReinitializeAsync(driverConfigJson, ct)` on every driver whose `DriverInstance.DriverConfig` row changed in the new generation. Reinitialize is the in-process recovery path for Tier A/B drivers; if it fails the driver is marked `DriverState.Faulted` and its nodes go Bad quality — but the server process stays running. See `docs/v2/driver-stability.md`.
6. **Restore subscriptions** -- For each previously subscribed tag reference that still exists in `_tagToVariableNode` after rebuild, re-open the MXAccess subscription and restore the original ref-count.
Drivers whose discovery depends on Config DB state (Modbus register maps, S7 DBs, AB CIP tag lists) re-run their discovery inside `ReinitializeAsync`; Core then diffs the new node set against the current address space.
7. **Update cache** -- Replace `_lastHierarchy` and `_lastAttributes` with shallow copies of the new data.
## Rebuild flow
## TearDownGobjects
When a rediscovery is triggered (by either source), `GenericDriverNodeManager` re-runs `ITagDiscovery.DiscoverAsync` into the same `CapturingBuilder` it used at first build. The new node set is diffed against the current:
`TearDownGobjects` removes all OPC UA nodes and tracking state for a set of gobject IDs:
1. **Diff** — full-name comparison of the new `DriverAttributeInfo` set against the existing `_variablesByFullRef` map. Added / removed / modified references are partitioned.
2. **Snapshot subscriptions** — before teardown, Core captures the current monitored-item ref-counts for every affected reference so subscriptions can be replayed after rebuild.
3. **Teardown** — removed / modified variable nodes are deleted via `CustomNodeManager2.DeleteNode`. Driver-side subscriptions for those references are unwound via `ISubscribable.UnsubscribeAsync`.
4. **Rebuild** — added / modified references get fresh `BaseDataVariableState` nodes via the standard `IAddressSpaceBuilder.Variable(...)` path. Alarm-flagged references re-register their `IAlarmConditionSink` through `CapturingBuilder`.
5. **Restore subscriptions** — for every captured reference that still exists after rebuild, Core re-opens the driver subscription and restores the original ref-count.
For each gobject ID, it processes the associated tag references from `_gobjectToTagRefs`:
Exceptions during teardown are swallowed per decision #12 — a driver throw must not leave the node tree half-deleted.
1. **Unsubscribe** -- If the tag has an active MXAccess subscription (entry in `_subscriptionRefCounts`), call `UnsubscribeAsync` and remove the ref-count entry.
## Scope hint
2. **Remove alarm tracking** -- Find any `_alarmInAlarmTags` entries whose `SourceTagReference` matches the tag. For each, unsubscribe the InAlarm, Priority, and DescAttrName tags, then remove the alarm entry.
When `RediscoveryEventArgs.ScopeHint` is non-null (e.g. a folder path), Core restricts the diff to that subtree. This matters for Galaxy Platform-scoped deployments where a `time_of_last_deploy` advance may only affect one platform's subtree, and for OPC UA Client where an upstream change may be localized. Null scope falls back to a full-tree diff.
3. **Delete variable node** -- Call `DeleteNode` on the variable's `NodeId`, remove from `_tagToVariableNode`, clean up `_nodeIdToTagReference` and `_tagMetadata`, and decrement `VariableNodeCount`.
## Active subscriptions survive rebuild
4. **Delete object/folder node** -- Remove the gobject's entry from `_nodeMap` and call `DeleteNode`. Non-folder nodes decrement `ObjectNodeCount`.
Subscriptions for unchanged references stay live across rebuilds — their ref-count map is not disturbed. Clients monitoring a stable tag never see a data-change gap during a deploy, only clients monitoring a tag that was genuinely removed see the subscription drop.
All MXAccess calls and `DeleteNode` calls are wrapped in try/catch with ignored exceptions, since teardown must complete even if individual cleanup steps fail.
## Key source files
## BuildSubtree
`BuildSubtree` creates OPC UA nodes for a subset of the Galaxy hierarchy, reusing existing parent nodes from `_nodeMap`.
The method first topologically sorts the input hierarchy (same `TopologicalSort` used by `BuildAddressSpace`) to ensure parents are created before children. For each object:
1. **Find parent** -- Look up `ParentGobjectId` in `_nodeMap`. If the parent was not part of the changed set, it already exists from the previous build. If no parent is found, fall back to the root `ZB` folder. This is the key difference from `BuildAddressSpace` -- subtree builds reuse the existing node tree rather than starting from the root.
2. **Create node** -- Areas become `FolderState` with `Organizes` reference; non-areas become `BaseObjectState` with `HasComponent` reference. The node is added to `_nodeMap`.
3. **Create variable nodes** -- Attributes are processed with the same primitive-grouping logic as `BuildAddressSpace`, creating `BaseDataVariableState` nodes via `CreateAttributeVariable`.
4. **Alarm tracking** -- If `_alarmTrackingEnabled` is set, alarm attributes are detected and `AlarmConditionState` nodes are created using the same logic as the full build. EventNotifier flags are set on parent nodes, and alarm tags are auto-subscribed.
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IRediscoverable.cs` — backend-change capability
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — discovery orchestration
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriver.cs``ReinitializeAsync` contract
- `src/ZB.MOM.WW.OtOpcUa.Admin/Services/GenerationService.cs` — publish-flow driver
- `docs/v2/config-db-schema.md``sp_PublishGeneration` + `sp_ComputeGenerationDiff`
- `docs/v2/admin-ui.md` — DiffViewer + draft-revision-token flow

View File

@@ -1,166 +0,0 @@
# MXAccess Bridge
The MXAccess bridge connects the OPC UA server to the AVEVA System Platform runtime through the `ArchestrA.MxAccess` COM API. It handles all COM threading requirements, translates between OPC UA read/write requests and MXAccess operations, and manages connection health.
## STA Thread Requirement
MXAccess is a COM-based API that requires a Single-Threaded Apartment (STA). All COM objects -- `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls -- must execute on the same STA thread. Calling COM objects from the wrong thread causes marshalling failures or silent data corruption.
`StaComThread` provides a dedicated STA thread with the apartment state set before the thread starts:
```csharp
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);
```
Work items are queued via `RunAsync(Action)` or `RunAsync<T>(Func<T>)`, which enqueue the work to a `ConcurrentQueue<Action>` and post a `WM_APP` message to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread.
## Win32 Message Pump
COM callbacks (like `OnDataChange`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump using P/Invoke:
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
2. `GetMessage` blocks until a message arrives
3. `WM_APP` messages drain the work queue
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
5. All other messages are passed through `TranslateMessage`/`DispatchMessage` for COM callback delivery
Without this message pump, MXAccess COM callbacks would never fire and the server would receive no live data.
## LMXProxyServer COM Object
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface. This abstraction allows unit tests to substitute a fake proxy without requiring the ArchestrA runtime.
The COM object lifecycle:
1. **`Register(clientName)`** -- Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, and calls `Register` to obtain a connection handle
2. **`Unregister(handle)`** -- Unwires event handlers, calls `Unregister`, and releases the COM object via `Marshal.ReleaseComObject`
## Register/AddItem/AdviseSupervisory Pattern
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
1. **`AddItem(handle, address)`** -- Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
2. **`AdviseSupervisory(handle, itemHandle)`** -- Subscribes the item for supervisory data change callbacks
3. The runtime begins delivering `OnDataChange` events for the item
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value to the runtime. The `OnWriteComplete` callback confirms or rejects the write.
Cleanup reverses the pattern: `UnAdviseSupervisory` then `RemoveItem`.
## OnDataChange and OnWriteComplete Callbacks
### OnDataChange
Fired by the COM runtime on the STA thread when a subscribed tag value changes. The handler in `MxAccessClient.EventHandlers.cs`:
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
2. Maps the MXAccess quality code to the internal `Quality` enum
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality accordingly
4. Converts the timestamp to UTC
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global `OnTagValueChanged` event (consumed by `LmxNodeManager`)
### OnWriteComplete
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0`, the write is considered failed and the error detail is logged.
## Reconnection Logic
`MxAccessClient` implements automatic reconnection through two mechanisms:
### Monitor loop
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
- If connected and a probe tag is configured, it checks the probe staleness threshold
### Reconnect sequence
`ReconnectAsync` performs a full disconnect-then-connect cycle:
1. Increment the reconnect counter
2. `DisconnectAsync` -- Tears down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detaches COM event handlers, calls `Unregister`, and clears all handle mappings
3. `ConnectAsync` -- Creates a fresh `LMXProxyServer`, registers, replays all stored subscriptions, and re-subscribes the probe tag
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. When `ConnectAsync` succeeds, `ReplayStoredSubscriptionsAsync` iterates all stored entries and calls `AddItem` + `AdviseSupervisory` for each.
## Probe Tag Health Monitoring
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange` callback.
The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
Separate from the connection-level probe above, the bridge advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields.
### How it works
`GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped):
1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**.
3. **On-change-only delivery**`ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped."
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Without this rollback, a failed subscribe would leave the entry in `Unknown` forever, and `Tick()` would later transition it to `Stopped` after the unknown-resolution timeout, fanning out a **false-negative** host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1.
### Subtree quality invalidation on transition
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values.
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
### Read-path short-circuit (`IsTagUnderStoppedHost`)
`LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan.
The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]``GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
### Deferred dispatch: the STA deadlock
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement.
See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern.
### Dashboard + health surface
- Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected).
- Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions.
- `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging.
See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields.
## Request Timeout Safety Backstop
Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). This is a backstop: `MxAccessClient.Read/Write` already enforce inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
## Why Marshal.ReleaseComObject Is Needed
The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/StaComThread.cs` -- STA thread and Win32 message pump
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.cs` -- Core client class (partial)
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Connection.cs` -- Connect, disconnect, reconnect
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Subscription.cs` -- Subscribe, unsubscribe, replay
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs` -- Read and write operations
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper
- `src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface

View File

@@ -1,137 +1,88 @@
# OPC UA Server
The OPC UA server component hosts the Galaxy-backed namespace on a configurable TCP endpoint and exposes deployed System Platform objects and attributes to OPC UA clients.
The OPC UA server component (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs`) hosts the OPC UA stack and exposes one browsable subtree per registered driver. The server itself is driver-agnostic — Galaxy/MXAccess, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, and OPC UA Client are all plugged in as `IDriver` implementations via the capability interfaces in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`.
## Composition
`OtOpcUaServer` subclasses the OPC Foundation `StandardServer` and wires:
- A `DriverHost` (`src/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs`) which registers drivers and holds the per-instance `IDriver` references.
- One `DriverNodeManager` per registered driver (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`), constructed in `CreateMasterNodeManager`. Each manager owns its own namespace URI (`urn:OtOpcUa:{DriverInstanceId}`) and exposes the driver as a subtree under the standard `Objects` folder.
- A `CapabilityInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs`) per driver instance, keyed on `(DriverInstanceId, HostName, DriverCapability)` against the shared `DriverResiliencePipelineBuilder`. Every Read/Write/Discovery/Subscribe/HistoryRead/AlarmSubscribe call on the driver flows through this invoker so the Polly pipeline (retry / timeout / breaker / bulkhead) applies. The OTOPCUA0001 Roslyn analyzer enforces the wrapping at compile time.
- An `IUserAuthenticator` (LDAP in production, injected stub in tests) for `UserName` token validation in the `ImpersonateUser` hook.
- Optional `AuthorizationGate` + `NodeScopeResolver` (Phase 6.2) that sit in front of every dispatch call. In lax mode the gate passes through when the identity lacks LDAP groups so existing integration tests keep working; strict mode (`Authorization:StrictMode = true`) denies those cases.
`OtOpcUaServer.DriverNodeManagers` exposes the materialized list so the hosting layer can walk each one post-start and call `GenericDriverNodeManager.BuildAddressSpaceAsync(manager)` — the manager is passed as its own `IAddressSpaceBuilder`.
## Configuration
`OpcUaConfiguration` defines the server endpoint and session settings. All properties have sensible defaults:
Server wiring used to live in `appsettings.json`. It now flows from the SQL Server **Config DB**: `ServerInstance` + `DriverInstance` + `Tag` + `NodeAcl` rows are published as a *generation* via `sp_PublishGeneration` and loaded into the running process by the generation applier. The Admin UI (Blazor Server, `docs/v2/admin-ui.md`) is the operator surface — drafts accumulate edits; `sp_ComputeGenerationDiff` drives the DiffViewer preview; a UNS drag-reorder carries a `DraftRevisionToken` so Confirm re-checks against the current draft and returns 409 if it advanced (decision #161). See `docs/v2/config-db-schema.md` for the schema.
| Property | Default | Description |
|----------|---------|-------------|
| `BindAddress` | `0.0.0.0` | IP address or hostname the server binds to |
| `Port` | `4840` | TCP port the server listens on |
| `EndpointPath` | `/LmxOpcUa` | URI path appended to the base address |
| `ServerName` | `LmxOpcUa` | Application name presented to clients |
| `GalaxyName` | `ZB` | Galaxy name used in the namespace URI |
| `MaxSessions` | `100` | Maximum concurrent client sessions |
| `SessionTimeoutMinutes` | `30` | Idle session timeout |
| `AlarmTrackingEnabled` | `false` | Enables `AlarmConditionState` nodes for alarm attributes |
| `AlarmFilter.ObjectFilters` | `[]` | Wildcard template-name patterns that scope alarm tracking to matching objects and their descendants (see [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter)) |
Environmental knobs that aren't per-tenant (bind address, port, PKI path) still live in `appsettings.json` on the Server project; everything tenant-scoped moved to the Config DB.
The resulting endpoint URL is `opc.tcp://{BindAddress}:{Port}{EndpointPath}`, e.g., `opc.tcp://0.0.0.0:4840/LmxOpcUa`.
## Transport
The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` and is used as the `ProductUri`. The `ApplicationUri` can be set independently via `OpcUa.ApplicationUri` to support redundant deployments where each instance needs a unique identity. When `ApplicationUri` is null, it defaults to the namespace URI.
The server binds one TCP endpoint per `ServerInstance` (default `opc.tcp://0.0.0.0:4840`). The `ApplicationConfiguration` is built programmatically in the `OpcUaApplicationHost` — there are no UA XML files. Security profiles (`None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`) are resolved from the `ServerInstance.Security` JSON at startup; the default profile is still `None` for backward compatibility. User token policies (`Anonymous`, `UserName`) are attached based on whether LDAP is configured. See `docs/security.md` for hardening.
## Programmatic ApplicationConfiguration
## Session impersonation
`OpcUaServerHost` builds the entire `ApplicationConfiguration` in code. There are no XML configuration files. This keeps deployment simple on factory floor machines where editing XML is error-prone.
`OtOpcUaServer.OnImpersonateUser` handles the three token types:
The configuration covers:
- `AnonymousIdentityToken` → default anonymous `UserIdentity`.
- `UserNameIdentityToken``IUserAuthenticator.AuthenticateAsync` validates the credential (`LdapUserAuthenticator` in production). On success, the resolved display name + LDAP-derived roles are wrapped in a `RoleBasedIdentity` that implements `IRoleBearer`. `DriverNodeManager.OnWriteValue` reads these roles via `context.UserIdentity is IRoleBearer` and applies `WriteAuthzPolicy` per write.
- Anything else → `BadIdentityTokenInvalid`.
- **ServerConfiguration** -- base address, session limits, security policies, and user token policies
- **SecurityConfiguration** -- certificate store paths under `%LOCALAPPDATA%\OPC Foundation\pki\`, auto-accept enabled
- **TransportQuotas** -- 4 MB max message/string/byte-string size, 120-second operation timeout, 1-hour security token lifetime
- **TraceConfiguration** -- OPC Foundation SDK tracing is disabled (output path `null`, trace masks `0`); all logging goes through Serilog instead
The Phase 6.2 `AuthorizationGate` runs on top of this baseline: when configured it consults the cluster's permission trie (loaded from `NodeAcl` rows) using the session's `UserAuthorizationState` and can deny Read / HistoryRead / Write / Browse independently per tag. See `docs/v2/acl-design.md`.
## Security Profiles
## Dispatch
The server supports configurable transport security profiles controlled by the `Security` section in `appsettings.json`. The default configuration exposes only `MessageSecurityMode.None` for backward compatibility.
Every service call the stack hands to `DriverNodeManager` is translated to the driver's capability interface and routed through `CapabilityInvoker`:
Supported Phase 1 profiles:
| Profile Name | SecurityPolicy URI | MessageSecurityMode |
| Service | Capability | Invoker method |
|---|---|---|
| `None` | `SecurityPolicy#None` | `None` |
| `Basic256Sha256-Sign` | `SecurityPolicy#Basic256Sha256` | `Sign` |
| `Basic256Sha256-SignAndEncrypt` | `SecurityPolicy#Basic256Sha256` | `SignAndEncrypt` |
| Read | `IReadable.ReadAsync` | `ExecuteAsync(DriverCapability.Read, host, …)` |
| Write | `IWritable.WriteAsync` | `ExecuteWriteAsync(host, isIdempotent, …)` — honors `WriteIdempotentAttribute` (#143) |
| CreateMonitoredItems / DeleteMonitoredItems | `ISubscribable.SubscribeAsync/UnsubscribeAsync` | `ExecuteAsync(DriverCapability.Subscribe, host, …)` |
| HistoryRead (raw / processed / at-time / events) | `IHistoryProvider.*Async` | `ExecuteAsync(DriverCapability.HistoryRead, host, …)` |
| ConditionRefresh / Acknowledge | `IAlarmSource.*Async` | via `AlarmSurfaceInvoker` (fans out per host) |
`SecurityProfileResolver` maps configured profile names to `ServerSecurityPolicy` instances at startup. Unknown names are skipped with a warning, and an empty or invalid list falls back to `None`.
For production deployments, configure `["Basic256Sha256-SignAndEncrypt"]` or `["None", "Basic256Sha256-SignAndEncrypt"]` and set `AutoAcceptClientCertificates` to `false`. See the [Security Guide](security.md) for hardening details.
The host name fed to the invoker comes from `IPerCallHostResolver.ResolveHost(fullReference)` when the driver implements it (multi-host drivers: AB CIP, Modbus with per-device options). Single-host drivers fall back to `DriverInstanceId`, preserving pre-Phase-6.1 pipeline-key semantics (decision #144).
## Redundancy
When `Redundancy.Enabled = true`, `LmxOpcUaServer` exposes the standard OPC UA redundancy nodes on startup:
`Redundancy.Enabled = true` on the `ServerInstance` activates the `RedundancyCoordinator` + `ServiceLevelCalculator` (`src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`). Standard OPC UA redundancy nodes (`Server/ServerRedundancy/RedundancySupport`, `ServerUriArray`, `Server/ServiceLevel`) are populated on startup; `ServiceLevel` recomputes whenever any driver's `DriverHealth` changes. The apply-lease mechanism prevents two instances from concurrently applying a generation. See `docs/Redundancy.md`.
- `Server/ServerRedundancy/RedundancySupport` — set to `Warm` or `Hot` based on configuration
- `Server/ServerRedundancy/ServerUriArray` — populated with the configured `ServerUris`
- `Server/ServiceLevel` — computed dynamically from role and runtime health
## Server class hierarchy
The `ServiceLevel` is updated whenever MXAccess connection state changes or Galaxy DB health changes. See [Redundancy Guide](Redundancy.md) for full details.
### OtOpcUaServer extends StandardServer
### User token policies
- **`CreateMasterNodeManager`** — Iterates `_driverHost.RegisteredDriverIds`, builds one `DriverNodeManager` per driver with its own `CapabilityInvoker` + resilience options (tier from `DriverTypeRegistry`, per-instance JSON overrides from `DriverInstance.ResilienceConfig` via `DriverResilienceOptionsParser`). The managers are wrapped in a `MasterNodeManager` with no additional core managers.
- **`OnServerStarted`** — Hooks `SessionManager.ImpersonateUser` for LDAP auth. Redundancy + server-capability population happens via `OpcUaApplicationHost`.
- **`LoadServerProperties`** — Manufacturer `OtOpcUa`, Product `OtOpcUa.Server`, ProductUri `urn:OtOpcUa:Server`.
`UserTokenPolicies` are dynamically configured based on the `Authentication` settings in `appsettings.json`:
### ServerCapabilities
- An `Anonymous` user token policy is added when `AllowAnonymous` is `true` (the default).
- A `UserName` user token policy is added when an authentication provider is configured (LDAP or injected).
Both policies can be active simultaneously, allowing clients to connect with or without credentials.
### Session impersonation
When a client presents `UserName` credentials, the server validates them through `IUserAuthenticationProvider`. If the provider also implements `IRoleProvider` (as `LdapAuthenticationProvider` does), LDAP group membership is resolved once during authentication and mapped to custom OPC UA role `NodeId`s in a dedicated `urn:zbmom:lmxopcua:roles` namespace. These role NodeIds are added to the session's `RoleBasedIdentity.GrantedRoleIds`.
Anonymous sessions receive `WellKnownRole_Anonymous`. Authenticated sessions receive `WellKnownRole_AuthenticatedUser` plus any LDAP-derived role NodeIds. Permission checks in `LmxNodeManager` inspect `GrantedRoleIds` directly — no username extraction or side-channel cache is needed.
`AnonymousCanWrite` controls whether anonymous sessions can write, regardless of whether LDAP is enabled.
`OpcUaApplicationHost` populates `Server/ServerCapabilities` with `StandardUA2017`, `en` locale, 100 ms `MinSupportedSampleRate`, 4 MB message caps, and per-operation limits (1000 per Read/Write/Browse/TranslateBrowsePaths/MonitoredItems/HistoryRead; 0 for MethodCall/NodeManagement/HistoryUpdate).
## Certificate handling
On startup, `OpcUaServerHost.StartAsync` calls `CheckApplicationInstanceCertificate(false, minKeySize)` to locate or create a self-signed certificate meeting the configured minimum key size (default 2048). The certificate subject defaults to `CN={ServerName}, O=ZB MOM, DC=localhost` but can be overridden via `Security.CertificateSubject`. Certificate stores use the directory-based store type under the configured `Security.PkiRootPath` (default `%LOCALAPPDATA%\OPC Foundation\pki\`):
Certificate stores default to `%LOCALAPPDATA%\OPC Foundation\pki\` (directory-based):
| Store | Path suffix |
|-------|-------------|
|---|---|
| Own | `pki/own` |
| Trusted issuers | `pki/issuer` |
| Trusted peers | `pki/trusted` |
| Rejected | `pki/rejected` |
`AutoAcceptUntrustedCertificates` is controlled by `Security.AutoAcceptClientCertificates` (default `true`). Set to `false` in production to enforce client certificate trust. When `RejectSHA1Certificates` is `true` (default), client certificates signed with SHA-1 are rejected. Certificate validation events are logged for visibility into accepted and rejected client connections.
## Server class hierarchy
### LmxOpcUaServer extends StandardServer
`LmxOpcUaServer` inherits from the OPC Foundation `StandardServer` base class and overrides two methods:
- **`CreateMasterNodeManager`** -- Instantiates `LmxNodeManager` with the Galaxy namespace URI, the `IMxAccessClient` for runtime I/O, performance metrics, and an optional `IHistorianDataSource` (supplied by the runtime-loaded historian plugin, see [Historical Data Access](HistoricalDataAccess.md)). The node manager is wrapped in a `MasterNodeManager` with no additional core node managers.
- **`OnServerStarted`** -- Configures redundancy, history capabilities, and server capabilities at startup. Called after the server is fully initialized.
- **`LoadServerProperties`** -- Returns server metadata: manufacturer `ZB MOM`, product `LmxOpcUa Server`, and the assembly version as the software version.
### ServerCapabilities
`ConfigureServerCapabilities` populates the `ServerCapabilities` node at startup:
- **ServerProfileArray** -- `StandardUA2017`
- **LocaleIdArray** -- `en`
- **MinSupportedSampleRate** -- 100ms
- **MaxBrowseContinuationPoints** -- 100
- **MaxHistoryContinuationPoints** -- 100
- **MaxArrayLength** -- 65535
- **MaxStringLength / MaxByteStringLength** -- 4MB
- **OperationLimits** -- 1000 nodes per Read/Write/Browse/RegisterNodes/TranslateBrowsePaths/MonitoredItems/HistoryRead; 0 for MethodCall/NodeManagement/HistoryUpdate (not supported)
- **ServerDiagnostics.EnabledFlag** -- `true` (SDK tracks session/subscription counts automatically)
### Session tracking
`LmxOpcUaServer` exposes `ActiveSessionCount` by querying `ServerInternal.SessionManager.GetSessions().Count`. `OpcUaServerHost` surfaces this for status reporting.
## Startup and Shutdown
`OpcUaServerHost.StartAsync` performs the following sequence:
1. Build `ApplicationConfiguration` programmatically
2. Validate the configuration via `appConfig.Validate(ApplicationType.Server)`
3. Create `ApplicationInstance` and check/create the application certificate
4. Instantiate `LmxOpcUaServer` and start it via `ApplicationInstance.Start`
`OpcUaServerHost.Stop` calls `_server.Stop()` and nulls both the server and application instance references. The class implements `IDisposable`, delegating to `Stop`.
`Security.AutoAcceptClientCertificates` (default `true`) and `RejectSHA1Certificates` (default `true`) are honored. The server certificate is always created — even for `None`-only deployments — because `UserName` token encryption needs it.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/OpcUaServerHost.cs` -- Application lifecycle and programmatic configuration
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/LmxOpcUaServer.cs` -- StandardServer subclass and node manager creation
- `src/ZB.MOM.WW.OtOpcUa.Host/OpcUa/SecurityProfileResolver.cs` -- Profile-name to ServerSecurityPolicy mapping
- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/OpcUaConfiguration.cs` -- Configuration POCO
- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/SecurityProfileConfiguration.cs` -- Security configuration POCO
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs` `StandardServer` subclass + `ImpersonateUser` hook
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — per-driver `CustomNodeManager2` + dispatch surface
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs` — programmatic `ApplicationConfiguration` + lifecycle
- `src/ZB.MOM.WW.OtOpcUa.Core/Hosting/DriverHost.cs` — driver registration
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs` — Polly pipeline entry point
- `src/ZB.MOM.WW.OtOpcUa.Core/Authorization/` — Phase 6.2 permission trie + evaluator
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` — stack-to-evaluator bridge

83
docs/README.md Normal file
View File

@@ -0,0 +1,83 @@
# OtOpcUa documentation
Two tiers of documentation live here:
- **Current reference** at the top level (`docs/*.md`) — describes what's shipped today. Start here for operator + integrator reference.
- **Implementation history + design notes** at `docs/v2/*.md` — the authoritative plan + decision log the current reference is built from. Start here when you need the *why* behind an architectural choice, or when a top-level doc says "see plan.md § X".
The project was originally called **LmxOpcUa** (a single-driver Galaxy/MXAccess OPC UA server) and has since become **OtOpcUa**, a multi-driver OPC UA server platform. Any lingering `LmxOpcUa`-string in a path you see in docs is a deliberate residual (executable name `lmxopcua-cli`, client PKI folder `{LocalAppData}/LmxOpcUaClient/`) — fixing those requires migration shims + is tracked as follow-ups.
## Platform overview
- **Core** owns the OPC UA stack, address space, session/security/subscription machinery.
- **Drivers** plug in via capability interfaces in `ZB.MOM.WW.OtOpcUa.Core.Abstractions`: `IDriver`, `IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, `IHistoryProvider`, `IPerCallHostResolver`. Each driver opts into whichever it supports.
- **Server** is the OPC UA endpoint process (net10, x64). Hosts every driver except Galaxy in-process; talks to Galaxy via a named pipe because MXAccess COM is 32-bit-only.
- **Admin** is the Blazor Server operator UI (net10, x64). Owns the Config DB draft/publish flow, ACL + role-grant authoring, fleet status + `/metrics` scrape endpoint.
- **Galaxy.Host** is a .NET Framework 4.8 x86 Windows service that wraps MXAccess COM on an STA thread for the Galaxy driver.
## Where to find what
### Architecture + data-path reference
| Doc | Covers |
|-----|--------|
| [OpcUaServer.md](OpcUaServer.md) | Top-level server architecture — Core, driver dispatch, Config DB, generations |
| [AddressSpace.md](AddressSpace.md) | `GenericDriverNodeManager` + `ITagDiscovery` + `IAddressSpaceBuilder` |
| [ReadWriteOperations.md](ReadWriteOperations.md) | OPC UA Read/Write → `CapabilityInvoker``IReadable`/`IWritable` |
| [Subscriptions.md](Subscriptions.md) | Monitored items → `ISubscribable` + per-driver subscription refcount |
| [AlarmTracking.md](AlarmTracking.md) | `IAlarmSource` + `AlarmSurfaceInvoker` + OPC UA alarm conditions |
| [DataTypeMapping.md](DataTypeMapping.md) | Per-driver `DriverAttributeInfo` → OPC UA variable types |
| [IncrementalSync.md](IncrementalSync.md) | Address-space rebuild on redeploy + `sp_ComputeGenerationDiff` |
| [HistoricalDataAccess.md](HistoricalDataAccess.md) | `IHistoryProvider` as a per-driver optional capability |
### Drivers
| Doc | Covers |
|-----|--------|
| [drivers/README.md](drivers/README.md) | Index of the seven shipped drivers + capability matrix |
| [drivers/Galaxy.md](drivers/Galaxy.md) | Galaxy driver — MXAccess bridge, Host/Proxy split, named-pipe IPC |
| [drivers/Galaxy-Repository.md](drivers/Galaxy-Repository.md) | Galaxy-specific discovery via the ZB SQL database |
For Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS / OPC UA Client specifics, see [v2/driver-specs.md](v2/driver-specs.md).
### Operational
| Doc | Covers |
|-----|--------|
| [Configuration.md](Configuration.md) | appsettings bootstrap + Config DB + Admin UI draft/publish |
| [security.md](security.md) | Transport security profiles, LDAP auth, ACL trie, role grants, OTOPCUA0001 analyzer |
| [Redundancy.md](Redundancy.md) | `RedundancyCoordinator`, `ServiceLevelCalculator`, apply-lease, Prometheus metrics |
| [ServiceHosting.md](ServiceHosting.md) | Three-process deploy (Server + Admin + Galaxy.Host) install/uninstall |
| [StatusDashboard.md](StatusDashboard.md) | Pointer — superseded by [v2/admin-ui.md](v2/admin-ui.md) |
### Client tooling
| Doc | Covers |
|-----|--------|
| [Client.CLI.md](Client.CLI.md) | `lmxopcua-cli` — command-line client |
| [Client.UI.md](Client.UI.md) | Avalonia desktop client |
### Requirements
| Doc | Covers |
|-----|--------|
| [reqs/HighLevelReqs.md](reqs/HighLevelReqs.md) | HLRs — numbered system-level requirements |
| [reqs/OpcUaServerReqs.md](reqs/OpcUaServerReqs.md) | OPC UA server-layer reqs |
| [reqs/ServiceHostReqs.md](reqs/ServiceHostReqs.md) | Per-process hosting reqs |
| [reqs/ClientRequirements.md](reqs/ClientRequirements.md) | Client CLI + UI reqs |
| [reqs/GalaxyRepositoryReqs.md](reqs/GalaxyRepositoryReqs.md) | Galaxy-scoped repository reqs |
| [reqs/MxAccessClientReqs.md](reqs/MxAccessClientReqs.md) | Galaxy-scoped MXAccess reqs |
| [reqs/StatusDashboardReqs.md](reqs/StatusDashboardReqs.md) | Pointer — superseded by Admin UI |
## Implementation history (`docs/v2/`)
Design decisions + phase plans + execution notes. Load-bearing cross-references from the top-level docs:
- [v2/plan.md](v2/plan.md) — authoritative v2 vision doc + numbered decision log (referenced as "decision #N" elsewhere)
- [v2/admin-ui.md](v2/admin-ui.md) — Admin UI spec
- [v2/acl-design.md](v2/acl-design.md) — data-plane ACL + permission-trie design (Phase 6.2)
- [v2/config-db-schema.md](v2/config-db-schema.md) — Config DB schema reference
- [v2/driver-specs.md](v2/driver-specs.md) — per-driver addressing + quirks for every shipped protocol
- [v2/dev-environment.md](v2/dev-environment.md) — dev-box bootstrap
- [v2/test-data-sources.md](v2/test-data-sources.md) — integration-test simulator matrix (includes the pinned libplctag `ab_server` version for AB CIP tests)
- [v2/implementation/phase-*-*.md](v2/implementation/) — per-phase execution plans with exit-gate evidence

View File

@@ -1,99 +1,57 @@
# Read/Write Operations
`LmxNodeManager` overrides the OPC UA `Read` and `Write` methods to translate client requests into MXAccess runtime calls. Each override resolves the OPC UA `NodeId` to a Galaxy tag reference, performs the I/O through `IMxAccessClient`, and returns the result with appropriate status codes.
`DriverNodeManager` (`src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs`) wires the OPC UA stack's per-variable `OnReadValue` and `OnWriteValue` hooks to each driver's `IReadable` and `IWritable` capabilities. Every dispatch flows through `CapabilityInvoker` so the Polly pipeline (retry / timeout / breaker / bulkhead) applies uniformly across Galaxy, Modbus, S7, AB CIP, AB Legacy, TwinCAT, FOCAS, and OPC UA Client drivers.
## Read Override
## OnReadValue
The `Read` override in `LmxNodeManager` intercepts value attribute reads for nodes in the Galaxy namespace.
The hook is registered on every `BaseDataVariableState` created by the `IAddressSpaceBuilder.Variable(...)` call during discovery. When the stack dispatches a Read for a node in this namespace:
### Resolution flow
1. If the driver does not implement `IReadable`, the hook returns `BadNotReadable`.
2. The node's `NodeId.Identifier` is used directly as the driver-side full reference — it matches `DriverAttributeInfo.FullName` registered at discovery time.
3. (Phase 6.2) If an `AuthorizationGate` + `NodeScopeResolver` are wired, the gate is consulted first via `IsAllowed(identity, OpcUaOperation.Read, scope)`. A denied read never hits the driver.
4. The call is wrapped by `_invoker.ExecuteAsync(DriverCapability.Read, ResolveHostFor(fullRef), …)`. The resolved host is `IPerCallHostResolver.ResolveHost(fullRef)` for multi-host drivers; single-host drivers fall back to `DriverInstanceId` (decision #144).
5. The first `DataValueSnapshot` from the batch populates the outgoing `value` / `statusCode` / `timestamp`. An empty batch surfaces `BadNoData`; any exception surfaces `BadInternalError`.
1. The base class `Read` runs first, handling non-value attributes (DisplayName, DataType, etc.) through the standard node manager.
2. For each `ReadValueId` where `AttributeId == Attributes.Value`, the override checks whether the node belongs to this namespace (`NamespaceIndex` match).
3. The string-typed `NodeId.Identifier` is looked up in `_nodeIdToTagReference` to find the corresponding `FullTagReference` (e.g., `DelmiaReceiver_001.DownloadPath`).
4. `_mxAccessClient.ReadAsync(tagRef)` retrieves the current value, timestamp, and quality from MXAccess. The async call is synchronously awaited because the OPC UA SDK `Read` override is synchronous.
5. The returned `Vtq` is converted to a `DataValue` via `CreatePublishedDataValue`, which normalizes array values through `NormalizePublishedValue` (substituting a default typed array when the value is null for array nodes).
6. On success, `errors[i]` is set to `ServiceResult.Good`. On exception, the error is set to `BadInternalError`.
The hook is synchronous — the async invoker call is bridged with `AsTask().GetAwaiter().GetResult()` because the OPC UA SDK's value-hook signature is sync. Idempotent-by-construction reads mean this bridge is safe to retry inside the Polly pipeline.
```csharp
if (_nodeIdToTagReference.TryGetValue(nodeIdStr, out var tagRef))
{
var vtq = _mxAccessClient.ReadAsync(tagRef).GetAwaiter().GetResult();
results[i] = CreatePublishedDataValue(tagRef, vtq);
errors[i] = ServiceResult.Good;
}
```
## OnWriteValue
## Write Override
`OnWriteValue` follows the same shape with two additional concerns: authorization and idempotence.
The `Write` override follows a similar pattern but includes access-level enforcement and array element write support.
### Authorization (two layers)
### Access level check
1. **SecurityClassification gate.** Every variable stores its `SecurityClassification` in `_securityByFullRef` at registration time (populated from `DriverAttributeInfo.SecurityClass`). `WriteAuthzPolicy.IsAllowed(classification, userRoles)` runs first, consulting the session's roles via `context.UserIdentity is IRoleBearer`. `FreeAccess` passes anonymously, `ViewOnly` denies everyone, and `Operate / Tune / Configure / SecuredWrite / VerifiedWrite` require `WriteOperate / WriteTune / WriteConfigure` roles respectively. Denial returns `BadUserAccessDenied` without consulting the driver — drivers never enforce ACLs themselves; they only report classification as discovery metadata (feedback `feedback_acl_at_server_layer.md`).
2. **Phase 6.2 permission-trie gate.** When `AuthorizationGate` is wired, it re-runs with the operation derived from `WriteAuthzPolicy.ToOpcUaOperation(classification)`. The gate consults the per-cluster permission trie loaded from `NodeAcl` rows, enforcing fine-grained per-tag ACLs on top of the role-based classification policy. See `docs/v2/acl-design.md`.
The base class `Write` runs first and sets `BadNotWritable` for nodes whose `AccessLevel` does not include `CurrentWrite`. The override skips these nodes:
### Dispatch
```csharp
if (errors[i] != null && errors[i].StatusCode == StatusCodes.BadNotWritable)
continue;
```
`_invoker.ExecuteWriteAsync(host, isIdempotent, callSite, …)` honors the `WriteIdempotentAttribute` semantics per decisions #44-45 and #143:
The `AccessLevel` is set during node creation based on `SecurityClassificationMapper.IsWritable(attr.SecurityClassification)`. Read-only Galaxy attributes (e.g., security classification `FreeRead`) get `AccessLevels.CurrentRead` only.
- `isIdempotent = true` (tag flagged `WriteIdempotent` in the Config DB) → runs through the standard `DriverCapability.Write` pipeline; retry may apply per the tier configuration.
- `isIdempotent = false` (default) → the invoker builds a one-off pipeline with `RetryCount = 0`. A timeout may fire after the device already accepted the pulse / alarm-ack / counter-increment; replay is the caller's decision, not the server's.
### Write flow
The `_writeIdempotentByFullRef` lookup is populated at discovery time from the `DriverAttributeInfo.WriteIdempotent` field.
1. The `NodeId` is resolved to a tag reference via `_nodeIdToTagReference`.
2. The raw value is extracted from `writeValue.Value.WrappedValue.Value`.
3. If the write includes an `IndexRange` (array element write), `TryApplyArrayElementWrite` handles the merge before sending the full array to MXAccess.
4. `_mxAccessClient.WriteAsync(tagRef, value)` sends the value to the Galaxy runtime.
5. On success, `PublishLocalWrite` updates the in-memory node immediately so subscribed clients see the change without waiting for the next MXAccess data change callback.
### Per-write status
### Array element writes via IndexRange
`IWritable.WriteAsync` returns `IReadOnlyList<WriteResult>` — one numeric `StatusCode` per requested write. A non-zero code is surfaced directly to the client; exceptions become `BadInternalError`. The OPC UA stack's pattern of batching per-service is preserved through the full chain.
`TryApplyArrayElementWrite` supports writing individual elements of an array attribute. MXAccess does not support element-level writes, so the method performs a read-modify-write:
## Array element writes
1. Parse the `IndexRange` string as a zero-based integer index. Return `BadIndexRangeInvalid` if parsing fails or the index is negative.
2. Read the current array value from MXAccess via `ReadAsync`.
3. Clone the array and set the element at the target index.
4. `NormalizeIndexedWriteValue` unwraps single-element arrays (OPC UA clients sometimes wrap a scalar in a one-element array).
5. `ConvertArrayElementValue` coerces the value to the array's element type using `Convert.ChangeType`, handling null values by substituting the type's default.
6. The full modified array is written back to MXAccess as a single `WriteAsync` call.
Array-element writes via OPC UA `IndexRange` are driver-specific. The OPC UA stack hands the dispatch an unwrapped `NumericRange` on the `indexRange` parameter of `OnWriteValue`; `DriverNodeManager` passes the full `value` object to `IWritable.WriteAsync` and the driver decides whether to support partial writes. Galaxy performs a read-modify-write inside the Galaxy driver (MXAccess has no element-level writes); other drivers generally accept only full-array writes today.
```csharp
var nextArray = (Array)currentArray.Clone();
nextArray.SetValue(ConvertArrayElementValue(normalizedValue, elementType), index);
updatedArray = nextArray;
```
## HistoryRead
### Role-based write enforcement
`DriverNodeManager.HistoryReadRawModified`, `HistoryReadProcessed`, `HistoryReadAtTime`, and `HistoryReadEvents` route through the driver's `IHistoryProvider` capability with `DriverCapability.HistoryRead`. Drivers without `IHistoryProvider` surface `BadHistoryOperationUnsupported` per node. See `docs/HistoricalDataAccess.md`.
When `AnonymousCanWrite` is `false` in the `Authentication` configuration, the write override enforces role-based access control before dispatching to MXAccess. The check order is:
## Failure isolation
1. The base class `Write` runs first, enforcing `AccessLevel`. Nodes without `CurrentWrite` get `BadNotWritable` and the override skips them.
2. The override checks whether the node is in the Galaxy namespace. Non-namespace nodes are skipped.
3. If `AnonymousCanWrite` is `false`, the override inspects `context.OperationContext.Session` for `GrantedRoleIds`. If the session does not hold `WellKnownRole_AuthenticatedUser`, the error is set to `BadUserAccessDenied` and the write is rejected.
4. If the role check passes (or `AnonymousCanWrite` is `true`), the write proceeds to MXAccess.
Per decision #12, exceptions in the driver's capability call are logged and converted to a per-node `BadInternalError` — they never unwind into the master node manager. This keeps one driver's outage from disrupting sibling drivers in the same server process.
The existing security classification enforcement (ReadOnly nodes getting `BadNotWritable` via `AccessLevel`) still applies first and takes precedence over the role check.
## Key source files
## Value Type Conversion
`CreatePublishedDataValue` wraps the conversion pipeline. `NormalizePublishedValue` checks whether the tag is an array type with a declared `ArrayDimension` and substitutes a default typed array (via `CreateDefaultArrayValue`) when the raw value is null. This prevents OPC UA clients from receiving a null variant for array nodes, which violates the specification for nodes declared with `ValueRank.OneDimension`.
`CreateDefaultArrayValue` uses `MxDataTypeMapper.MapToClrType` to determine the CLR element type, then creates an `Array.CreateInstance` of the declared length. String arrays are initialized with `string.Empty` elements rather than null.
## PublishLocalWrite
After a successful write, `PublishLocalWrite` updates the variable node in memory without waiting for the MXAccess `OnDataChange` callback to arrive:
```csharp
private void PublishLocalWrite(string tagRef, object? value)
{
var dataValue = CreatePublishedDataValue(tagRef, Vtq.Good(value));
variable.Value = dataValue.Value;
variable.StatusCode = dataValue.StatusCode;
variable.Timestamp = dataValue.SourceTimestamp;
variable.ClearChangeMasks(SystemContext, false);
}
```
`ClearChangeMasks` notifies the OPC UA framework that the node value has changed, which triggers data change notifications to any active monitored items. Without this call, subscribed clients would only see the update when the next MXAccess data change event arrives, which could be delayed depending on the subscription interval.
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs``OnReadValue` / `OnWriteValue` hooks
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/WriteAuthzPolicy.cs` — classification-to-role policy
- `src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` — Phase 6.2 trie gate
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs``ExecuteAsync` / `ExecuteWriteAsync`
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IReadable.cs`, `IWritable.cs`, `WriteIdempotentAttribute.cs`

View File

@@ -2,189 +2,102 @@
## Overview
LmxOpcUa supports OPC UA **non-transparent redundancy** in Warm or Hot mode. In a non-transparent redundancy deployment, two independent server instances run side by side. Both connect to the same Galaxy repository database and the same MXAccess runtime, but each maintains its own OPC UA sessions and subscriptions. Clients discover the redundant set through the `ServerUriArray` exposed in each server's address space and are responsible for managing failover between the two endpoints.
OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes.
When redundancy is disabled (the default), the server reports `RedundancySupport.None` and a fixed `ServiceLevel` of 255.
The redundancy surface lives in `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`:
## Namespace vs Application Identity
Both servers in the redundant set share the same **namespace URI** so that clients see identical node IDs regardless of which instance they are connected to. The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` (e.g., `urn:ZB:LmxOpcUa`).
The **ApplicationUri**, on the other hand, must be unique per instance. This is how the OPC UA stack and clients distinguish one server from the other within the redundant set. Each instance sets its own ApplicationUri via the `OpcUa.ApplicationUri` configuration property (e.g., `urn:localhost:LmxOpcUa:instance1` and `urn:localhost:LmxOpcUa:instance2`).
When redundancy is disabled, `ApplicationUri` defaults to `urn:{GalaxyName}:LmxOpcUa` if left null.
## Configuration
### Redundancy Section
| Property | Type | Default | Description |
|---|---|---|---|
| `Enabled` | bool | `false` | Enables non-transparent redundancy. When false, the server reports `RedundancySupport.None` and `ServiceLevel = 255`. |
| `Mode` | string | `"Warm"` | The redundancy mode advertised to clients. Valid values: `Warm`, `Hot`. |
| `Role` | string | `"Primary"` | This instance's role in the redundant pair. Valid values: `Primary`, `Secondary`. The Primary advertises a higher ServiceLevel than the Secondary when both are healthy. |
| `ServerUris` | string[] | `[]` | The ApplicationUri values of all servers in the redundant set. Must include this instance's own `OpcUa.ApplicationUri`. Should contain at least 2 entries. |
| `ServiceLevelBase` | int | `200` | The base ServiceLevel when the server is fully healthy. Valid range: 1-255. The Secondary automatically receives `ServiceLevelBase - 50`. |
### OpcUa.ApplicationUri
| Property | Type | Default | Description |
|---|---|---|---|
| `ApplicationUri` | string | `null` | Explicit application URI for this server instance. When null, defaults to `urn:{GalaxyName}:LmxOpcUa`. **Required when redundancy is enabled** -- each instance needs a unique identity. |
## ServiceLevel Computation
ServiceLevel is a standard OPC UA diagnostic value (0-255) that indicates server health. Clients in a redundant deployment should prefer the server advertising the highest ServiceLevel.
**Baseline values:**
| Role | Baseline |
| Class | Role |
|---|---|
| Primary | `ServiceLevelBase` (default 200) |
| Secondary | `ServiceLevelBase - 50` (default 150) |
| `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. |
| `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. |
| `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. |
| `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. |
| `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. |
| `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. |
| `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. |
**Penalties applied to the baseline:**
## Data model
| Condition | Penalty |
Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`):
| Column | Role |
|---|---|
| MXAccess disconnected | -100 |
| Galaxy DB unreachable | -50 |
| Both MXAccess and DB down | ServiceLevel forced to 0 |
| `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. |
| `ClusterId` | Foreign key into `ServerCluster`. |
| `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). |
| `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. |
| `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. |
The final value is clamped to the range 0-255.
`ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes.
**Examples (with default ServiceLevelBase = 200):**
## ServiceLevel matrix
| Scenario | Primary | Secondary |
`ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file):
| Band | Byte | Meaning |
|---|---|---|
| Both healthy | 200 | 150 |
| MXAccess down | 100 | 50 |
| DB down | 150 | 100 |
| Both down | 0 | 0 |
| `Maintenance` | 0 | Operator-declared maintenance. |
| `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). |
| `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. |
| `RecoveringBackup` | 30 | Backup post-fault, dwell not met. |
| `BackupMidApply` | 50 | Backup inside a publish-apply window. |
| `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). |
| `AuthoritativeBackup` | 100 | Backup nominal. |
| `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. |
| `PrimaryMidApply` | 200 | Primary inside a publish-apply window. |
| `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. |
| `AuthoritativePrimary` | 255 | Primary nominal. |
## Two-Instance Deployment
The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working.
When deploying a redundant pair, the following configuration properties must differ between the two instances. All other settings (GalaxyName, ConnectionString, etc.) are shared.
Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish.
| Property | Instance 1 (Primary) | Instance 2 (Secondary) |
|---|---|---|
| `OpcUa.Port` | 4840 | 4841 |
| `OpcUa.ServerName` | `LmxOpcUa-1` | `LmxOpcUa-2` |
| `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` |
| `Dashboard.Port` | 8081 | 8082 |
| `MxAccess.ClientName` | `LmxOpcUa-1` | `LmxOpcUa-2` |
| `Redundancy.Role` | `Primary` | `Secondary` |
## Publish fencing and split-brain prevention
### Instance 1 -- Primary (appsettings.json)
Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held:
```json
{
"OpcUa": {
"Port": 4840,
"ServerName": "LmxOpcUa-1",
"GalaxyName": "ZB",
"ApplicationUri": "urn:localhost:LmxOpcUa:instance1"
},
"MxAccess": {
"ClientName": "LmxOpcUa-1"
},
"Dashboard": {
"Port": 8081
},
"Redundancy": {
"Enabled": true,
"Mode": "Warm",
"Role": "Primary",
"ServerUris": [
"urn:localhost:LmxOpcUa:instance1",
"urn:localhost:LmxOpcUa:instance2"
],
"ServiceLevelBase": 200
}
}
```
- The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation.
- `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically.
- The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`.
### Instance 2 -- Secondary (appsettings.json)
Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154).
```json
{
"OpcUa": {
"Port": 4841,
"ServerName": "LmxOpcUa-2",
"GalaxyName": "ZB",
"ApplicationUri": "urn:localhost:LmxOpcUa:instance2"
},
"MxAccess": {
"ClientName": "LmxOpcUa-2"
},
"Dashboard": {
"Port": 8082
},
"Redundancy": {
"Enabled": true,
"Mode": "Warm",
"Role": "Secondary",
"ServerUris": [
"urn:localhost:LmxOpcUa:instance1",
"urn:localhost:LmxOpcUa:instance2"
],
"ServiceLevelBase": 200
}
}
```
## Metrics
## CLI `redundancy` Command
`RedundancyMetrics` in `src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments:
The Client CLI includes a `redundancy` command that reads the redundancy state from a running server.
| Name | Kind | Tags | Description |
|---|---|---|---|
| `otopcua.redundancy.role_transition` | Counter<long> | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. |
| `otopcua.redundancy.primary_count` | ObservableGauge<long> | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. |
| `otopcua.redundancy.secondary_count` | ObservableGauge<long> | `cluster.id` | Secondary-role nodes per cluster. |
| `otopcua.redundancy.stale_count` | ObservableGauge<long> | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. |
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4840/LmxOpcUa
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4841/LmxOpcUa
```
Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed.
The command reads the following standard OPC UA nodes and displays their values:
## Real-time notifications (Admin UI)
- **Redundancy Mode** -- from `Server_ServerRedundancy_RedundancySupport` (None, Warm, or Hot)
- **Service Level** -- from `Server_ServiceLevel` (0-255)
- **Server URIs** -- from `Server_ServerRedundancy_ServerUriArray` (list of ApplicationUri values in the redundant set)
- **Application URI** -- from `Server_ServerArray` (this instance's ApplicationUri)
`FleetStatusPoller` in `src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On<RoleChangedMessage>("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen.
Example output for a healthy Primary:
## Configuring a redundant pair
```
Redundancy Mode: Warm
Service Level: 200
Server URIs:
- urn:localhost:LmxOpcUa:instance1
- urn:localhost:LmxOpcUa:instance2
Application URI: urn:localhost:LmxOpcUa:instance1
```
Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances:
The command also supports `--username`/`--password` and `--security` options for authenticated or encrypted connections.
| Field | Location | Instance 1 | Instance 2 |
|---|---|---|---|
| `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` |
| `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` |
| `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` |
| `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 |
### Client Failover with `-F`
Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances.
All CLI commands support the `-F` / `--failover-urls` flag for automatic client-side failover. When provided, the CLI tries the primary endpoint first and falls back to the listed URLs if the primary is unreachable.
Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart.
```bash
# Connect with failover — uses secondary if primary is down
dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa
## Client-side failover
# Subscribe with live failover — reconnects to secondary if primary drops mid-stream
dotnet run -- subscribe -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa \
-n "ns=1;s=TestMachine_001.MachineID"
```
The OtOpcUa Client CLI at `src/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference.
For long-running commands (`subscribe`), the CLI monitors the session KeepAlive and automatically reconnects to the next available server when the current session drops. The subscription is re-created on the new server.
## Depth reference
## Troubleshooting
**Mismatched ServerUris between instances** -- Both instances must list the exact same set of ApplicationUri values in `Redundancy.ServerUris`. If they differ, clients may not discover the full redundant set. Check the startup log for the `Redundancy.ServerUris` line on each instance.
**ServiceLevel stuck at 255** -- This indicates redundancy is not enabled. When `Redundancy.Enabled` is false (the default), the server always reports `ServiceLevel = 255` and `RedundancySupport.None`. Verify that `Redundancy.Enabled` is set to `true` in the configuration and that the configuration section is correctly bound.
**ApplicationUri not set** -- The configuration validator rejects startup when redundancy is enabled but `OpcUa.ApplicationUri` is null or empty. Each instance must have a unique ApplicationUri. Check the error log for: `OpcUa.ApplicationUri must be set when redundancy is enabled`.
**Both servers report the same ServiceLevel** -- Verify that one instance has `Redundancy.Role` set to `Primary` and the other to `Secondary`. Both set to `Primary` (or both to `Secondary`) will produce identical baseline values, preventing clients from distinguishing the preferred server.
**ServerUriArray not readable** -- When `RedundancySupport` is `None` (redundancy disabled), the OPC UA SDK may not expose the `ServerUriArray` node or it may return an empty value. The CLI `redundancy` command handles this gracefully by catching the read error. Enable redundancy to populate this array.
For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3.

View File

@@ -2,189 +2,132 @@
## Overview
The service runs as a Windows service or console application using TopShelf for lifecycle management. It targets .NET Framework 4.8 with an x86 (32-bit) platform target, which is required for MXAccess COM interop with the ArchestrA runtime DLLs.
A production OtOpcUa deployment runs **three processes**, each with a distinct runtime, platform target, and install surface:
## TopShelf Configuration
| Process | Project | Runtime | Platform | Responsibility |
|---|---|---|---|---|
| **OtOpcUa Server** | `src/ZB.MOM.WW.OtOpcUa.Server` | .NET 10 | x64 | Hosts the OPC UA endpoint; loads every non-Galaxy driver in-process; exposes `/healthz`. |
| **OtOpcUa Admin** | `src/ZB.MOM.WW.OtOpcUa.Admin` | .NET 10 (ASP.NET Core / Blazor Server) | x64 | Operator UI for Config DB editing + fleet status, SignalR hubs (`FleetStatusHub`, `AlertHub`), Prometheus `/metrics`. |
| **OtOpcUa Galaxy.Host** | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host` | .NET Framework 4.8 | x86 (32-bit) | Hosts MXAccess COM on a dedicated STA thread with a Win32 message pump; exposes a named-pipe IPC surface consumed by `Driver.Galaxy.Proxy` inside the Server process. |
`Program.Main()` configures TopShelf to manage the `OpcUaService` lifecycle:
The x86 / .NET Framework 4.8 constraint applies **only** to Galaxy.Host because the MXAccess toolkit DLLs (`Program Files (x86)\ArchestrA\Framework\bin`) are 32-bit-only COM. Every other driver (Modbus, S7, OpcUaClient, AbCip, AbLegacy, TwinCAT, FOCAS) runs in-process in the 64-bit Server.
## Server process
`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs` uses the generic host:
```csharp
var exitCode = HostFactory.Run(host =>
{
host.UseSerilog();
host.Service<OpcUaService>(svc =>
{
svc.ConstructUsing(() => new OpcUaService());
svc.WhenStarted(s => s.Start());
svc.WhenStopped(s => s.Stop());
});
host.SetServiceName("LmxOpcUa");
host.SetDisplayName("LMX OPC UA Server");
host.SetDescription("OPC UA server exposing System Platform Galaxy tags via MXAccess.");
host.RunAsLocalSystem();
host.StartAutomatically();
});
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddSerilog();
builder.Services.AddWindowsService(o => o.ServiceName = "OtOpcUa");
builder.Services.AddHostedService<OpcUaServerService>();
builder.Services.AddHostedService<HostStatusPublisher>();
```
TopShelf provides these deployment modes from the same executable:
`OpcUaServerService` is a `BackgroundService` (decision #30 — TopShelf from v1 was replaced by the generic-host `AddWindowsService` wrapper; no TopShelf dependency remains in any csproj). It owns:
| Command | Description |
|---------|-------------|
| `OtOpcUa.Host.exe` | Run as a console application (foreground) |
| `OtOpcUa.Host.exe install` | Install as a Windows service |
| `OtOpcUa.Host.exe uninstall` | Remove the Windows service |
| `OtOpcUa.Host.exe start` | Start the installed service |
| `OtOpcUa.Host.exe stop` | Stop the installed service |
1. Config bootstrap — reads `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString`, `Node:LocalCachePath` from `appsettings.json`.
2. `NodeBootstrap` — pulls the latest published generation from the Config DB into the LiteDB local cache (`LiteDbConfigCache`) so the node starts even if the central DB is briefly unreachable.
3. `DriverHost` — instantiates configured driver instances from the generation, wires each through `CapabilityInvoker` resilience pipelines.
4. `OpcUaApplicationHost` — builds the OPC UA endpoint, applies `OpcUaServerOptions` + `LdapOptions`, registers `AuthorizationGate` at dispatch.
5. `HostStatusPublisher` — a second hosted service that heartbeats `DriverHostStatus` rows so the Admin UI Fleet view sees the node.
The service is configured to run as `LocalSystem` and start automatically on boot.
### Installation
## Working Directory
Same executable, different modes driven by the .NET generic-host `AddWindowsService` wrapper:
Before configuring Serilog, `Program.Main()` sets the working directory to the executable's location:
| Mode | Invocation |
|---|---|
| Console | `ZB.MOM.WW.OtOpcUa.Server.exe` |
| Install as Windows service | `sc create OtOpcUa binPath="C:\Program Files\OtOpcUa\Server\ZB.MOM.WW.OtOpcUa.Server.exe" start=auto` |
| Start | `sc start OtOpcUa` |
| Stop | `sc stop OtOpcUa` |
| Uninstall | `sc delete OtOpcUa` |
```csharp
Environment.CurrentDirectory = AppDomain.CurrentDomain.BaseDirectory;
```
### Health endpoints
This is necessary because Windows services default their working directory to `System32`, which would cause relative log paths and `appsettings.json` to resolve incorrectly.
The Server exposes `/healthz` + `/readyz` used by (a) the Admin `FleetStatusPoller` as input to Fleet status and (b) `PeerReachabilityTracker` in a peer Server process as the HTTP side of the peer-reachability probe.
## Startup Sequence
## Admin process
`OpcUaService.Start()` executes the following steps in order. If any required step fails, the service logs the error and throws, preventing a partially initialized state.
`src/ZB.MOM.WW.OtOpcUa.Admin/Program.cs` is a stock `WebApplication`. Highlights:
1. **Load configuration** -- The production constructor reads `appsettings.json`, optional environment overlay, and environment variables, then binds each section to its typed configuration class.
2. **Validate configuration** -- `ConfigurationValidator.ValidateAndLog()` logs all resolved values and checks required constraints (port range, non-empty names and connection strings). If validation fails, the service throws `InvalidOperationException`.
3. **Register exception handler** -- Registers `AppDomain.CurrentDomain.UnhandledException` to log fatal unhandled exceptions with `IsTerminating` context.
4. **Create performance metrics** -- Creates the `PerformanceMetrics` instance and a `CancellationTokenSource` for coordinating shutdown.
5. **Create and connect MXAccess client** -- Starts the STA COM thread, creates the `MxAccessClient`, and attempts an initial connection. If the connection fails, the service logs a warning and continues -- the monitor loop will retry in the background.
6. **Start MXAccess monitor** -- Starts the connectivity monitor loop that probes the runtime connection at the configured interval and handles auto-reconnect.
7. **Test Galaxy repository connection** -- Calls `TestConnectionAsync()` on the Galaxy repository to verify the SQL Server database is reachable. If it fails, the service continues without initial address-space data.
8. **Create OPC UA server host** -- Creates `OpcUaServerHost` with the effective MXAccess client (real, override, or null fallback), performance metrics, and an optional `IHistorianDataSource` obtained from `HistorianPluginLoader.TryLoad` when `Historian.Enabled=true` (returns `null` if the plugin is absent or fails to load).
9. **Query Galaxy hierarchy** -- Fetches the object hierarchy and attribute definitions from the Galaxy repository database, recording object and attribute counts.
10. **Start server and build address space** -- Starts the OPC UA server, retrieves the `LmxNodeManager`, and calls `BuildAddressSpace()` with the queried hierarchy and attributes. If the query or build fails, the server still starts with an empty address space.
11. **Start change detection** -- Creates and starts `ChangeDetectionService`, which polls `galaxy.time_of_last_deploy` at the configured interval. When a change is detected, it triggers an address-space rebuild via the `OnGalaxyChanged` event.
12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled. If `StatusWebServer.Start()` returns `false` (port already bound, insufficient permissions, etc.), the service logs a warning, disposes the unstarted instance, sets `OpcUaService.DashboardStartFailed = true`, and continues in degraded mode. Matches the warning-continue policy applied to MxAccess connect, Galaxy DB connect, and initial address space build. Stability review 2026-04-13 Finding 2.
13. **Log startup complete** -- Logs "LmxOpcUa service started successfully" at `Information` level.
- Cookie auth (`CookieAuthenticationDefaults`, scheme name `OtOpcUa.Admin`) + Blazor Server (`AddInteractiveServerComponents`) + SignalR.
- Authorization policies gated by `AdminRoles`: `ConfigViewer`, `ConfigEditor`, `FleetAdmin` (see `Services/AdminRoles.cs`). `CanEdit` policy requires `ConfigEditor` or `FleetAdmin`; `CanPublish` requires `FleetAdmin`.
- `OtOpcUaConfigDbContext` registered against `ConnectionStrings:ConfigDb`.
- Scoped services: `ClusterService`, `GenerationService`, `EquipmentService`, `UnsService`, `NamespaceService`, `DriverInstanceService`, `NodeAclService`, `PermissionProbeService`, `AclChangeNotifier`, `ReservationService`, `DraftValidationService`, `AuditLogService`, `HostStatusService`, `ClusterNodeService`, `EquipmentImportBatchService`, `ILdapGroupRoleMappingService`.
- Singleton `RedundancyMetrics` (meter name `ZB.MOM.WW.OtOpcUa.Redundancy`) + `CertTrustService` (promotes rejected client certs in the Server's PKI store to trusted via the Admin Certificates page).
- `LdapAuthService` bound to `Authentication:Ldap` — same LDAP flow as ScadaLink CentralUI for visual parity.
- SignalR hubs mapped at `/hubs/fleet` and `/hubs/alerts`; `FleetStatusPoller` runs as a hosted service and pushes `RoleChanged`, host status, and alert events.
- OpenTelemetry → Prometheus exporter at `/metrics` when `Metrics:Prometheus:Enabled=true` (default). Pull-based means no Collector required in the common K8s deploy.
## Shutdown Sequence
### Installation
`OpcUaService.Stop()` tears down components in reverse dependency order:
Deployed as an ASP.NET Core service; the generic-host `AddWindowsService` wrapper (or IIS reverse-proxy for multi-node fleets) provides install/uninstall. Listens on whatever `ASPNETCORE_URLS` specifies.
1. **Cancel operations** -- Signals the `CancellationTokenSource` to stop all background loops.
2. **Stop change detection** -- Stops the Galaxy deploy polling loop.
3. **Stop OPC UA server** -- Shuts down the OPC UA server host, disconnecting all client sessions.
4. **Stop MXAccess monitor** -- Stops the connectivity monitor loop.
5. **Disconnect MXAccess** -- Disconnects the MXAccess client and releases COM resources.
6. **Dispose STA thread** -- Shuts down the dedicated STA COM thread and its message pump.
7. **Stop dashboard** -- Disposes the `StatusWebServer` HTTP listener.
8. **Dispose metrics** -- Releases the performance metrics collector.
9. **Dispose change detection** -- Releases the change detection service.
10. **Unregister exception handler** -- Removes the `AppDomain.UnhandledException` handler.
## Galaxy.Host process
The entire shutdown is wrapped in a `try/catch` that logs warnings for errors during cleanup, ensuring the service exits even if a component fails to dispose cleanly.
`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Program.cs` is a .NET Framework 4.8 x86 console executable. Configuration comes from environment variables supplied by the supervisor (`Driver.Galaxy.Proxy.Supervisor`):
## Error Handling
| Env var | Purpose |
|---|---|
| `OTOPCUA_GALAXY_PIPE` | Pipe name the host listens on (default `OtOpcUaGalaxy`). |
| `OTOPCUA_ALLOWED_SID` | SID of the Server process's principal; anyone else is refused during the handshake. |
| `OTOPCUA_GALAXY_SECRET` | Per-spawn shared secret the client must present in the Hello frame. |
| `OTOPCUA_GALAXY_BACKEND` | `mxaccess` (default), `db` (ZB-only, no COM), `stub` (in-memory; for tests). |
| `OTOPCUA_GALAXY_ZB_CONN` | SQL connection string to the ZB Galaxy repository. |
| `OTOPCUA_HISTORIAN_*` | Optional Wonderware Historian SDK config if Historian is enabled for this node. |
### Unhandled exceptions
The host spins up `StaPump` (the STA thread with message pump), creates the MXAccess `LMXProxyServer` COM object on that thread, and handles all COM calls there; the IPC layer marshals work items via `PostThreadMessage`.
`AppDomain.CurrentDomain.UnhandledException` is registered at startup and removed at shutdown. The handler logs the exception at `Fatal` level with the `IsTerminating` flag:
### Pipe security
```csharp
Log.Fatal(e.ExceptionObject as Exception,
"Unhandled exception (IsTerminating={IsTerminating})", e.IsTerminating);
```
`PipeServer` builds a `PipeAcl` from the provided `SecurityIdentifier` + uses `NamedPipeServerStream` with `maxNumberOfServerInstances: 1`. The handshake requires a matching shared secret in the first Hello frame; callers whose SID doesn't match `OTOPCUA_ALLOWED_SID` are rejected before any frame is processed. **By design the pipe ACL denies BUILTIN\Administrators** — live smoke tests must therefore run from a non-elevated shell that matches the allowed principal. The installed dev host (`OtOpcUaGalaxyHost`) runs as `dohertj2` with the secret at `.local/galaxy-host-secret.txt`.
### Startup resilience
### Installation
The startup sequence is designed to degrade gracefully rather than fail entirely:
- If MXAccess connection fails, the service continues with a `NullMxAccessClient` that returns bad-quality values for all reads.
- If the Galaxy repository database is unreachable, the OPC UA server starts with an empty address space.
- If the status dashboard port is in use, the dashboard logs a warning and does not start, but the OPC UA server continues.
### Fatal startup failure
If a critical step (configuration validation, OPC UA server start) throws, `Start()` catches the exception, logs it at `Fatal`, and re-throws to let TopShelf report the failure.
## Logging
The service uses Serilog with two sinks configured in `Program.Main()`:
```csharp
Log.Logger = new LoggerConfiguration()
.MinimumLevel.Information()
.WriteTo.Console()
.WriteTo.File(
path: "logs/lmxopcua-.log",
rollingInterval: RollingInterval.Day,
retainedFileCountLimit: 31)
.CreateLogger();
```
| Sink | Details |
|------|---------|
| Console | Writes to stdout, useful when running as a console application |
| Rolling file | Writes to `logs/lmxopcua-{date}.log`, rolls daily, retains 31 days of history |
Log files are written relative to the executable directory (see Working Directory above). Each component creates its own contextual logger using `Log.ForContext<T>()` or `Log.ForContext(typeof(T))`.
`Log.CloseAndFlush()` is called in the `finally` block of `Program.Main()` to ensure all buffered log entries are written before process exit.
## Multi-Instance Deployment
The service supports running multiple instances for redundancy. Each instance requires:
- A unique Windows service name (e.g., `LmxOpcUa`, `LmxOpcUa2`)
- A unique OPC UA port and dashboard port
- A unique `OpcUa.ApplicationUri` and `OpcUa.ServerName`
- A unique `MxAccess.ClientName`
- Matching `Redundancy.ServerUris` arrays on all instances
Install additional instances using TopShelf's `-servicename` flag:
NSSM-wrapped (the Non-Sucking Service Manager) because the executable itself is a plain console app, not a `ServiceBase` Windows service. The supervisor then adopts the child process over the pipe after install. Install/uninstall commands follow the NSSM pattern:
```bash
cd C:\publish\lmxopcua\instance2
ZB.MOM.WW.OtOpcUa.Host.exe install -servicename "LmxOpcUa2" -displayname "LMX OPC UA Server (Instance 2)"
nssm install OtOpcUaGalaxyHost "C:\Program Files (x86)\OtOpcUa\Galaxy.Host\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe"
nssm set OtOpcUaGalaxyHost ObjectName .\dohertj2 <password>
nssm set OtOpcUaGalaxyHost AppEnvironmentExtra OTOPCUA_GALAXY_BACKEND=mxaccess OTOPCUA_GALAXY_SECRET=OTOPCUA_ALLOWED_SID=
nssm start OtOpcUaGalaxyHost
```
See [Redundancy Guide](Redundancy.md) for full deployment details.
(Exact values for the environment block are generated by the Admin UI + committed alongside `.local/galaxy-host-secret.txt` on the dev box.)
## Required Runtime Assemblies
The build uses Costura.Fody to embed all NuGet dependencies into the single `ZB.MOM.WW.OtOpcUa.Host.exe`. The only native dependency that must sit alongside the executable in every deployment is the MXAccess COM toolkit:
| Assembly | Purpose |
|----------|---------|
| `ArchestrA.MxAccess.dll` | MXAccess COM interop — runtime data access to Galaxy tags |
The Wonderware Historian SDK is packaged as a **runtime-loaded plugin** so hosts that will not use historical data access do not need the SDK installed. The plugin lives in a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`:
## Inter-process communication
```
ZB.MOM.WW.OtOpcUa.Host.exe
ArchestrA.MxAccess.dll
Historian/
ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll
aahClientManaged.dll
aahClientCommon.dll
aahClient.dll
Historian.CBE.dll
Historian.DPAPI.dll
ArchestrA.CloudHistorian.Contract.dll
┌──────────────────────────┐ LDAP bind (Authentication:Ldap) ┌──────────────────────────┐
│ OtOpcUa Admin (x64) │ ─────────────────────────────────────────────▶│ LDAP / AD │
│ Blazor Server + SignalR │ └──────────────────────────┘
/metrics (Prometheus) │ FleetStatusPoller → ClusterNode poll
│ ─────────────────────────────────────────────▶┌──────────────────────────┐
│ Cluster/Generation/ACL writes │ Config DB (SQL Server) │
└──────────────────────────┘ ─────────────────────────────────────────────▶│ OtOpcUaConfigDbContext │
▲ └──────────────────────────┘
│ SignalR ▲
│ (role change, │ sp_GetCurrentGenerationForCluster
│ host status, │ sp_PublishGeneration
│ alerts) │
┌──────────────────────────┐ │
│ OtOpcUa Server (x64) │ ──────────────────────────────────────────────────────────┘
│ OPC UA endpoint │
│ Non-Galaxy drivers │ Named pipe (OtOpcUaGalaxy) ┌──────────────────────────┐
│ Driver.Galaxy.Proxy │ ─────────────────────────────────────────────▶│ Galaxy.Host (x86 .NFx) │
│ │ SID + shared-secret handshake │ STA + message pump │
│ /healthz /readyz │ │ MXAccess COM │
└──────────────────────────┘ │ Historian SDK (opt) │
└──────────────────────────┘
```
At startup, if `Historian.Enabled=true` in `appsettings.json`, `HistorianPluginLoader` probes `Historian/ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll` via `Assembly.LoadFrom` and instantiates the plugin's entry point. An `AppDomain.AssemblyResolve` handler redirects the SDK assembly lookups (`aahClientManaged`, `aahClientCommon`, …) to the same subfolder so the CLR can resolve them when the plugin first JITs. If the plugin directory is absent or any SDK dependency fails to load, the loader logs a warning and the server continues to run with history support disabled — `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call.
## appsettings.json boundary
Deployment matrix:
Each process reads its own `appsettings.json` for **bootstrap only** — connection strings, LDAP bind config, transport security profile, redundancy node id, logging. The authoritative configuration tree (drivers, UNS, tags, ACLs) lives in the Config DB and is edited through the Admin UI. See [`Configuration.md`](Configuration.md) for the split.
| Scenario | Host exe | `ArchestrA.MxAccess.dll` | `Historian/` subfolder |
|----------|----------|--------------------------|------------------------|
| `Historian.Enabled=false` | required | required | **omit** |
| `Historian.Enabled=true` | required | required | required |
## Development bootstrap
`ArchestrA.MxAccess.dll` and the historian SDK DLLs are not redistributable — they are provided by the AVEVA System Platform and Historian installations on the target machine. The copies in `lib/` are taken from `Program Files (x86)\ArchestrA\Framework\bin` on a machine with the platform installed.
## Platform Target
The service must be compiled and run as x86 (32-bit). The MXAccess COM toolkit DLLs in `Program Files (x86)\ArchestrA\Framework\bin` are 32-bit only. Running the service as x64 or AnyCPU (64-bit preferred) causes COM interop failures when creating the `LMXProxyServer` object on the STA thread.
For the Windows install steps (SQL Server in Docker, .NET 10 SDK, .NET Framework 4.8 SDK, Docker Desktop WSL 2 backend, EF Core CLI, first-run migration), see [`docs/v2/dev-environment.md`](v2/dev-environment.md).

View File

@@ -1,274 +1,16 @@
# Status Dashboard
# Status Dashboard — Superseded
## Overview
This document has been superseded.
The service hosts an embedded HTTP status dashboard that surfaces real-time health, connection state, subscription counts, data change throughput, and Galaxy metadata. Operators access it through a browser to verify the bridge is functioning without needing an OPC UA client. The dashboard is enabled by default on port 8081 and can be disabled via configuration.
The single-process, HTTP-listener "Status Dashboard" (`StatusWebServer` bound to port 8081) belonged to v1 LmxOpcUa, where one process owned the OPC UA endpoint, the MXAccess bridge, and the operator surface. In the multi-process OtOpcUa platform the operator surface has moved into the **OtOpcUa Admin** app — a Blazor Server UI that talks to the shared Config DB and to every deployed node over SignalR (`FleetStatusHub`, `AlertHub`). Prometheus scraping lives on the Admin app's `/metrics` endpoint via OpenTelemetry (`Metrics:Prometheus:Enabled`).
## HTTP Server
Operator surfaces now covered by the Admin UI:
`StatusWebServer` wraps a `System.Net.HttpListener` bound to `http://+:{port}/`. It starts a background task that accepts requests in a loop and dispatches them by path. Only `GET` requests are accepted; all other methods return `405 Method Not Allowed`. Responses include `Cache-Control: no-cache` headers to prevent stale data in the browser.
- Fleet health, per-node role/ServiceLevel, crash-loop detection (`Fleet.razor`, `Hosts.razor`, `FleetStatusPoller`)
- Redundancy state + role transitions (`RedundancyMetrics`, `otopcua.redundancy.*`)
- Cluster + node + credential management (`ClusterService`, `ClusterNodeService`)
- Draft/publish generation editor, diff viewer, CSV import, UnsTab, IdentificationFields, RedundancyTab, AclsTab with Probe-this-permission
- Certificate trust management (`CertTrustService` promotes rejected client certs to trusted)
- Audit log viewer (`AuditLogService`)
### Endpoints
| Path | Content-Type | Description |
|------|-------------|-------------|
| `/` | `text/html` | Operator dashboard with auto-refresh |
| `/health` | `text/html` | Focused health page with service-level badge and component cards |
| `/api/status` | `application/json` | Full status snapshot as JSON (`StatusData`) |
| `/api/health` | `application/json` | Health endpoint (`HealthEndpointData`) -- returns `503` when status is `Unhealthy`, `200` otherwise |
Any other path returns `404 Not Found`.
## Health Check Logic
`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed:
1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state.
2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available.
3. **Rule 2 / 2c -- Degraded**: Any recorded operation has a low success rate. The sample threshold depends on the operation category:
- Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate.
- Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads.
4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts.
5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case.
6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational."
The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection.
## Status Data Model
`StatusReportService` aggregates data from all bridge components into a `StatusData` DTO, which is then rendered as HTML or serialized to JSON. The DTO contains the following sections:
### Connection
| Field | Type | Description |
|-------|------|-------------|
| `State` | `string` | Current MXAccess connection state (Connected, Disconnected, Connecting) |
| `ReconnectCount` | `int` | Number of reconnect attempts since startup |
| `ActiveSessions` | `int` | Number of active OPC UA client sessions |
### Health
| Field | Type | Description |
|-------|------|-------------|
| `Status` | `string` | Healthy, Degraded, or Unhealthy |
| `Message` | `string` | Operator-facing explanation |
| `Color` | `string` | CSS color token (green, yellow, red, gray) |
### Subscriptions
| Field | Type | Description |
|-------|------|-------------|
| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) |
| `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`<Host>.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load |
### Galaxy
| Field | Type | Description |
|-------|------|-------------|
| `GalaxyName` | `string` | Name of the Galaxy being bridged |
| `DbConnected` | `bool` | Whether the Galaxy repository database is reachable |
| `LastDeployTime` | `DateTime?` | Most recent deploy timestamp from the Galaxy |
| `ObjectCount` | `int` | Number of Galaxy objects in the address space |
| `AttributeCount` | `int` | Number of Galaxy attributes as OPC UA variables |
| `LastRebuildTime` | `DateTime?` | UTC timestamp of the last completed address-space rebuild |
### Data change
| Field | Type | Description |
|-------|------|-------------|
| `EventsPerSecond` | `double` | Rate of MXAccess data change events per second |
| `AvgBatchSize` | `double` | Average items processed per dispatch cycle |
| `PendingItems` | `int` | Items waiting in the dispatch queue |
| `TotalEvents` | `long` | Total MXAccess data change events since startup |
### Galaxy Runtime
Populated from the `GalaxyRuntimeProbeManager` that advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`.
| Field | Type | Description |
|-------|------|-------------|
| `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) |
| `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality |
| `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state |
| `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected |
| `Hosts` | `List<GalaxyRuntimeStatus>` | Per-host detail rows, sorted alphabetically by `ObjectName` |
Each `GalaxyRuntimeStatus` entry:
| Field | Type | Description |
|-------|------|-------------|
| `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) |
| `GobjectId` | `int` | Galaxy `gobject_id` of the host |
| `Kind` | `string` | `$WinPlatform` or `$AppEngine` |
| `State` | `enum` | `Unknown`, `Running`, or `Stopped` |
| `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad |
| `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column |
| `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback |
| `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery |
| `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks |
| `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses |
The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`).
### Operations
A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains:
- `TotalCount` -- total invocations
- `SuccessRate` -- fraction of successful operations
- `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`, `Percentile95Milliseconds` -- latency distribution
The instrumented operation names are:
| Name | Source |
|---|---|
| `Read` | MXAccess live tag reads (`MxAccessClient.ReadWrite.cs`) |
| `Write` | MXAccess live tag writes |
| `Subscribe` | MXAccess subscription attach |
| `HistoryReadRaw` | `LmxNodeManager.HistoryReadRawModified` -> historian plugin |
| `HistoryReadProcessed` | `LmxNodeManager.HistoryReadProcessed` -> historian plugin (aggregates) |
| `HistoryReadAtTime` | `LmxNodeManager.HistoryReadAtTime` -> historian plugin (interpolated) |
| `HistoryReadEvents` | `LmxNodeManager.HistoryReadEvents` -> historian plugin (alarm/event history) |
| `AlarmAcknowledge` | `LmxNodeManager.OnAlarmAcknowledge` -> MXAccess AckMsg write |
New operation names are auto-registered on first use, so the `Operations` dictionary only contains entries for features that have actually been exercised since startup.
### Historian
`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture and the [Runtime Health Counters](HistoricalDataAccess.md#runtime-health-counters) section for the data source instrumentation.
| Field | Type | Description |
|-------|------|-------------|
| `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration |
| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome` |
| `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` |
| `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly |
| `ServerName` | `string` | Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty |
| `Port` | `int` | Configured historian TCP port |
| `QueryTotal` | `long` | Total historian read queries attempted since startup (raw + aggregate + at-time + events) |
| `QuerySuccesses` | `long` | Queries that completed without an exception |
| `QueryFailures` | `long` | Queries that raised an exception — each failure also triggers the plugin's reconnect path |
| `ConsecutiveFailures` | `int` | Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3 |
| `LastSuccessTime` | `DateTime?` | UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup |
| `LastFailureTime` | `DateTime?` | UTC timestamp of the most recent failure |
| `LastQueryError` | `string?` | Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed |
| `ProcessConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **process** silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See [Two SDK connection silos](HistoricalDataAccess.md#two-sdk-connection-silos) |
| `EventConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **event** silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels |
| `ActiveProcessNode` | `string?` | Cluster node currently serving the process silo, or `null` when no process connection is open |
| `ActiveEventNode` | `string?` | Cluster node currently serving the event silo, or `null` when no event connection is open |
| `NodeCount` | `int` | Total configured historian cluster nodes. 1 for a legacy single-node deployment |
| `HealthyNodeCount` | `int` | Nodes currently eligible for new connections (not in failure cooldown) |
| `Nodes` | `List<HistorianClusterNodeState>` | Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime` |
The operator dashboard renders a cluster table inside the Historian panel when `NodeCount > 1`. Legacy single-node deployments render a compact `Node: <hostname>` line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures.
### Alarms
`AlarmStatusInfo` -- surfaces alarm-condition tracking health and dispatch counters.
| Field | Type | Description |
|-------|------|-------------|
| `TrackingEnabled` | `bool` | Whether `OpcUa.AlarmTrackingEnabled` is set in configuration |
| `ConditionCount` | `int` | Number of distinct alarm conditions currently tracked |
| `ActiveAlarmCount` | `int` | Number of alarms currently in the `InAlarm=true` state |
| `TransitionCount` | `long` | Total `InAlarm` transitions observed in the dispatch loop since startup |
| `AckEventCount` | `long` | Total alarm acknowledgement transitions observed since startup |
| `AckWriteFailures` | `long` | Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d). |
| `FilterEnabled` | `bool` | Whether `OpcUa.AlarmFilter.ObjectFilters` has any patterns configured |
| `FilterPatternCount` | `int` | Number of compiled filter patterns (after comma-splitting and trimming) |
| `FilterIncludedObjectCount` | `int` | Number of Galaxy objects included by the filter during the most recent address-space build. Zero when the filter is disabled. |
When the filter is active, the operator dashboard's Alarms panel renders an extra line `Filter: N pattern(s), M object(s) included` so operators can verify scope at a glance. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the matching rules and resolution algorithm.
### Redundancy
`RedundancyInfo` -- only populated when `Redundancy.Enabled=true` in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See [Redundancy](Redundancy.md) for the full guide.
### Footer
| Field | Type | Description |
|-------|------|-------------|
| `Timestamp` | `DateTime` | UTC time when the snapshot was generated |
| `Version` | `string` | Service assembly version |
## `/api/health` Payload
The health endpoint returns a `HealthEndpointData` document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail:
| Field | Type | Description |
|-------|------|-------------|
| `Status` | `string` | `Healthy`, `Degraded`, or `Unhealthy` (drives the HTTP status code) |
| `ServiceLevel` | `byte` | OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise |
| `RedundancyEnabled` | `bool` | Whether redundancy is configured |
| `RedundancyRole` | `string?` | `Primary` or `Secondary` when redundancy is enabled; `null` otherwise |
| `RedundancyMode` | `string?` | `Warm` or `Hot` when redundancy is enabled; `null` otherwise |
| `Components.MxAccess` | `string` | `Connected` or `Disconnected` |
| `Components.Database` | `string` | `Connected` or `Disconnected` |
| `Components.OpcUaServer` | `string` | `Running` or `Stopped` |
| `Components.Historian` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` -- matches `HistorianStatusInfo.PluginStatus` |
| `Components.Alarms` | `string` | `Disabled` or `Enabled` -- mirrors `OpcUa.AlarmTrackingEnabled` |
| `Uptime` | `string` | Formatted service uptime (e.g., `3d 5h 20m`) |
| `Timestamp` | `DateTime` | UTC time the snapshot was generated |
Monitoring tools should:
- Alert on `Status=Unhealthy` (HTTP 503) for hard outages.
- Alert on `Status=Degraded` (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.).
## HTML Dashboards
### `/` -- Operator dashboard
Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, **Historian**, **Alarms**, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray).
The page includes a `<meta http-equiv='refresh'>` tag set to the configured `RefreshIntervalSeconds` (default 10 seconds), so the browser polls automatically without JavaScript.
### `/health` -- Focused health view
Large status badge, computed `ServiceLevel` value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, **Historian**, **Alarm Tracking**. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring.
## Configuration
The dashboard is configured through the `Dashboard` section in `appsettings.json`:
```json
{
"Dashboard": {
"Enabled": true,
"Port": 8081,
"RefreshIntervalSeconds": 10
}
}
```
Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `StatusReportService` is still created so that other components can query health programmatically, but no HTTP listener is opened.
### Dashboard start failures are non-fatal
If the dashboard is enabled but the configured port is already bound (e.g., a previous instance did not clean up, another service is squatting on the port, or the user lacks URL-reservation rights), `StatusWebServer.Start()` logs the listener exception at Error level and returns `false`. `OpcUaService` then logs a Warning, disposes the unstarted instance, sets `DashboardStartFailed = true`, and continues in degraded mode — the OPC UA endpoint still starts. Operators can detect the failure by searching the service log for:
```
[WRN] Status dashboard failed to bind on port {Port}; service continues without dashboard
```
Stability review 2026-04-13 Finding 2.
## Component Wiring
`StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b:
```csharp
StatusReportInstance.SetComponents(
effectiveMxClient,
Metrics,
GalaxyStatsInstance,
ServerHost,
NodeManagerInstance,
_config.Redundancy,
_config.OpcUa.ApplicationUri,
_config.Historian);
```
This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is `null`, the report service falls back to default values (e.g., `ConnectionState.Disconnected`, zero counts, `HistorianPluginStatus.Disabled`).
The historian plugin status is sourced from `HistorianPluginLoader.LastOutcome`, which is updated on every load attempt. `OpcUaService` explicitly calls `HistorianPluginLoader.MarkDisabled()` when `Historian.Enabled=false` so the dashboard can distinguish "feature off" from "load failed" without ambiguity.
See [`docs/v2/admin-ui.md`](v2/admin-ui.md) for the current operator surface and [`docs/ServiceHosting.md`](ServiceHosting.md) for the three-process layout.

View File

@@ -1,135 +1,60 @@
# Subscriptions
`LmxNodeManager` bridges OPC UA monitored items to MXAccess runtime subscriptions using reference counting and a decoupled dispatch architecture. This design ensures that MXAccess COM callbacks (which run on the STA thread) never contend with the OPC UA framework lock.
Driver-side data-change subscriptions live behind `ISubscribable` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ISubscribable.cs`). The interface is deliberately mechanism-agnostic: it covers native subscriptions (Galaxy MXAccess advisory, OPC UA monitored items on an upstream server, TwinCAT ADS notifications) and driver-internal polled subscriptions (Modbus, AB CIP, S7, FOCAS). Core sees the same event shape regardless — drivers fire `OnDataChange` and Core dispatches to the matching OPC UA monitored items.
## Ref-Counted MXAccess Subscriptions
Multiple OPC UA clients can subscribe to the same Galaxy tag simultaneously. Rather than opening duplicate MXAccess subscriptions, `LmxNodeManager` maintains a reference count per tag in `_subscriptionRefCounts`.
### SubscribeTag
`SubscribeTag` increments the reference count for a tag reference. On the first subscription (count goes from 0 to 1), it calls `_mxAccessClient.SubscribeAsync` to open the MXAccess runtime subscription:
## ISubscribable surface
```csharp
internal void SubscribeTag(string fullTagReference)
{
lock (_lock)
{
if (_subscriptionRefCounts.TryGetValue(fullTagReference, out var count))
_subscriptionRefCounts[fullTagReference] = count + 1;
else
{
_subscriptionRefCounts[fullTagReference] = 1;
_ = _mxAccessClient.SubscribeAsync(fullTagReference, (_, _) => { });
}
}
}
Task<ISubscriptionHandle> SubscribeAsync(
IReadOnlyList<string> fullReferences,
TimeSpan publishingInterval,
CancellationToken cancellationToken);
Task UnsubscribeAsync(ISubscriptionHandle handle, CancellationToken cancellationToken);
event EventHandler<DataChangeEventArgs>? OnDataChange;
```
### UnsubscribeTag
A single `SubscribeAsync` call may batch many attributes and returns an opaque handle the caller passes back to `UnsubscribeAsync`. The driver may emit an immediate `OnDataChange` for each subscribed reference (the OPC UA initial-data convention) and then a push per change.
`UnsubscribeTag` decrements the reference count. When the count reaches zero, the MXAccess subscription is closed via `UnsubscribeAsync` and the tag is removed from the dictionary:
Every subscribe / unsubscribe call goes through `CapabilityInvoker.ExecuteAsync(DriverCapability.Subscribe, host, …)` so the per-host pipeline applies.
```csharp
if (count <= 1)
{
_subscriptionRefCounts.Remove(fullTagReference);
_ = _mxAccessClient.UnsubscribeAsync(fullTagReference);
}
else
_subscriptionRefCounts[fullTagReference] = count - 1;
```
## Reference counting at Core
Both methods use `lock (_lock)` (a private object, distinct from the OPC UA framework `Lock`) to serialize ref-count updates without blocking node value dispatches.
Multiple OPC UA clients can monitor the same variable simultaneously. Rather than open duplicate driver subscriptions, Core maintains a ref-count per `(driver, fullReference)` pair: the first OPC UA monitored-item for a reference triggers `ISubscribable.SubscribeAsync` with that single reference; each additional monitored-item just increments the count; decrement-to-zero triggers `UnsubscribeAsync`. Transferred subscriptions (client reconnect → resume session) replay against the same ref-count map so active driver subscriptions are preserved across session migration.
## OnMonitoredItemCreated
## Threading
The OPC UA framework calls `OnMonitoredItemCreated` when a client creates a monitored item. The override resolves the node handle to a tag reference and calls `SubscribeTag`, which opens the MXAccess subscription early so runtime values start arriving before the first publish cycle:
The STA thread story is now driver-specific, not a server-wide concern:
```csharp
protected override void OnMonitoredItemCreated(ServerSystemContext context,
NodeHandle handle, MonitoredItem monitoredItem)
{
base.OnMonitoredItemCreated(context, handle, monitoredItem);
var nodeIdStr = handle?.NodeId?.Identifier as string;
if (nodeIdStr != null && _nodeIdToTagReference.TryGetValue(nodeIdStr, out var tagRef))
SubscribeTag(tagRef);
}
```
- **Galaxy** runs its MXAccess COM objects on a dedicated STA thread with a Win32 message pump (`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs`) inside the standalone `Driver.Galaxy.Host` Windows service. The Proxy driver (`Driver.Galaxy.Proxy`) connects to the Host via named pipe and re-exposes the data on a free-threaded surface to Core. Core never touches COM.
- **Modbus / S7 / AB CIP / AB Legacy / TwinCAT / FOCAS** are free-threaded — they run their polling loops on ordinary `Task`s. Their `OnDataChange` fires on thread-pool threads.
- **OPC UA Client** delegates to the OPC Foundation stack's subscription loop.
`OnDeleteMonitoredItemsComplete` performs the inverse, calling `UnsubscribeTag` for each deleted monitored item.
The common contract: drivers are responsible for marshalling from whatever native thread the backend uses onto thread-pool threads before raising `OnDataChange`. Core's dispatch path acquires the OPC UA framework `Lock` and calls `ClearChangeMasks` on the corresponding `BaseDataVariableState` to notify subscribed clients.
## Data Change Dispatch Queue
## Dispatch
MXAccess delivers data change callbacks on the STA thread via the `OnTagValueChanged` event. These callbacks must not acquire the OPC UA framework `Lock` directly because the lock is also held during `Read`/`Write` operations that call into MXAccess (creating a potential deadlock with the STA thread). The solution is a `ConcurrentDictionary<string, Vtq>` named `_pendingDataChanges` that decouples the two threads.
Core's subscription dispatch path:
### Callback handler
1. `ISubscribable.OnDataChange` fires on a thread-pool thread with a `DataChangeEventArgs(subscriptionHandle, fullReference, DataValueSnapshot)`.
2. Core looks up the variable by `fullReference` in the driver's `DriverNodeManager` variable map.
3. Under the OPC UA framework `Lock`, the variable's `Value` / `StatusCode` / `Timestamp` are updated and `ClearChangeMasks(SystemContext, false)` is called.
4. The OPC Foundation stack then enqueues data-change notifications for every monitored-item attached to that variable, honoring each subscription's sampling + filter configuration.
`OnMxAccessDataChange` runs on the STA thread. It stores the latest value in the concurrent dictionary (coalescing rapid updates for the same tag) and signals the dispatch thread:
Batch coalescing — coalescing multiple pushes for the same reference between publish cycles — is done driver-side when the backend natively supports it (Galaxy keeps the v1 coalescing dictionary); otherwise the SDK's own data-change filter suppresses no-change notifications.
```csharp
private void OnMxAccessDataChange(string address, Vtq vtq)
{
Interlocked.Increment(ref _totalMxChangeEvents);
_pendingDataChanges[address] = vtq;
_dataChangeSignal.Set();
}
```
## Initial values
### Dispatch thread architecture
A freshly-built variable carries `StatusCode = BadWaitingForInitialData` until the driver delivers the first value. Drivers whose backends supply an initial read (Galaxy `AdviseSupervisory`, TwinCAT `AddDeviceNotification`) fire `OnDataChange` immediately after `SubscribeAsync` returns. Polled drivers fire the first push when their first poll cycle completes.
A dedicated background thread (`OpcUaDataChangeDispatch`) runs `DispatchLoop`, which waits on an `AutoResetEvent` with a 100ms timeout. The decoupled design exists for two reasons:
## Transferred subscription restoration
1. **Deadlock avoidance** -- The STA thread must not acquire the OPC UA `Lock`. The dispatch thread is a normal background thread that can safely acquire `Lock`.
2. **Batch coalescing** -- Multiple MXAccess callbacks for the same tag between dispatch cycles are collapsed to the latest value via dictionary key overwrite. Under high load, this reduces the number of `ClearChangeMasks` calls.
When an OPC UA session is resumed (client reconnect with `TransferSubscriptions`), Core walks the transferred monitored-items and ensures every referenced `(driver, fullReference)` has a live driver subscription. References already active (in-process migration) skip re-subscribing; references that lost their driver-side handle during the session gap are re-subscribed via `SubscribeAsync`.
The dispatch loop processes changes in two phases:
## Key source files
**Phase 1 (outside Lock):** Drain keys from `_pendingDataChanges`, convert each `Vtq` to a `DataValue` via `CreatePublishedDataValue`, and collect alarm transition events. MXAccess reads for alarm Priority and DescAttrName values also happen in this phase, since they call back into the STA thread.
**Phase 2 (inside Lock):** Apply all prepared updates to variable nodes and call `ClearChangeMasks` on each to trigger OPC UA data change notifications. Alarm events are reported in this same lock scope.
```csharp
lock (Lock)
{
foreach (var (variable, dataValue) in updates)
{
variable.Value = dataValue.Value;
variable.StatusCode = dataValue.StatusCode;
variable.Timestamp = dataValue.SourceTimestamp;
variable.ClearChangeMasks(SystemContext, false);
}
}
```
### ClearChangeMasks
`ClearChangeMasks(SystemContext, false)` is the mechanism that notifies the OPC UA framework a node's value has changed. The framework uses change masks internally to track which nodes have pending notifications for active monitored items. Calling this method causes the server to enqueue data change notifications for all monitoring clients of that node. The `false` parameter indicates that child nodes should not be recursively cleared.
## Transferred Subscription Restoration
When OPC UA sessions are transferred (e.g., client reconnects and resumes a previous session), the framework calls `OnMonitoredItemsTransferred`. The override collects the tag references for all transferred items and calls `RestoreTransferredSubscriptions`.
`RestoreTransferredSubscriptions` groups the tag references by count and, for each tag that does not already have an active ref-count entry, opens a new MXAccess subscription and sets the initial reference count:
```csharp
internal void RestoreTransferredSubscriptions(IEnumerable<string> fullTagReferences)
{
var transferredCounts = fullTagReferences
.GroupBy(tagRef => tagRef, StringComparer.OrdinalIgnoreCase)
.ToDictionary(g => g.Key, g => g.Count(), StringComparer.OrdinalIgnoreCase);
foreach (var kvp in transferredCounts)
{
lock (_lock)
{
if (_subscriptionRefCounts.ContainsKey(kvp.Key))
continue;
_subscriptionRefCounts[kvp.Key] = kvp.Value;
}
_ = _mxAccessClient.SubscribeAsync(kvp.Key, (_, _) => { });
}
}
```
Tags that already have in-memory bookkeeping are skipped to avoid double-counting when the transfer happens within the same server process (normal in-process session migration).
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ISubscribable.cs` — capability contract
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs` — pipeline wrapping
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs` — Galaxy STA thread + message pump
- Per-driver subscribe implementations in each `Driver.*` project

View File

@@ -1,6 +1,19 @@
# Galaxy Repository
# Galaxy Repository — Tag Discovery for the Galaxy Driver
`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. This data drives the construction of the OPC UA address space.
`GalaxyRepositoryService` reads the Galaxy object hierarchy and attribute metadata from the System Platform Galaxy Repository SQL Server database. It is the Galaxy driver's implementation of **`ITagDiscovery.DiscoverAsync`** — every driver has its own discovery source, and the Galaxy driver's is a direct SQL query against the Galaxy Repository (the `ZB` database). Other drivers use completely different mechanisms:
| Driver | `ITagDiscovery` source |
|--------|------------------------|
| Galaxy | ZB SQL hierarchy + attribute queries (this doc) |
| AB CIP | `@tags` walker against the PLC controller |
| AB Legacy | Data-table scan via PCCC `LogicalRead` on the PLC |
| TwinCAT | Beckhoff `SymbolLoaderFactory` — uploads the full symbol tree from the ADS runtime |
| S7 | Config-DB enumeration (no native symbol upload for S7comm) |
| Modbus | Config-DB enumeration (flat register map, user-authored) |
| FOCAS | CNC queries (`cnc_rdaxisname`, `cnc_rdmacroinfo`, …) + optional Config-DB overlays |
| OPC UA Client | `Session.Browse` against the remote server |
`GalaxyRepositoryService` lives in `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/` — Host-side, .NET Framework 4.8 x86, same process that owns the MXAccess COM objects. The Proxy forwards discovery over IPC the same way it forwards reads and writes.
## Connection Configuration
@@ -19,7 +32,7 @@ The connection uses Windows Authentication because the Galaxy Repository databas
## SQL Queries
All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used.
All queries are embedded as `const string` fields in `GalaxyRepositoryService`. No dynamic SQL is used. Project convention `GR-006` requires `const string` SQL queries; any new query must be added as a named constant rather than built at runtime.
### Hierarchy query
@@ -31,9 +44,9 @@ Returns deployed Galaxy objects with their parent relationships, browse names, a
- Marks objects with `category_id = 13` as areas
- Filters to `is_template = 0` (instances only, not templates)
- Filters to `deployed_package_id <> 0` (deployed objects only)
- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter).
- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `<Host>.ScanState` probe advised. Also used by `LmxNodeManager.BuildHostedVariablesMap` to identify Platform/Engine ancestors during the hosted-variables walk.
- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The node manager walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [MXAccess Bridge — Per-Host Runtime Status Probes](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate).
- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](../AlarmTracking.md#template-based-alarm-object-filter).
- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `<Host>.ScanState` probe advised. Also used during the hosted-variables walk to identify Platform/Engine ancestors.
- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The driver walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [Galaxy driver — Per-Host Runtime Status Probes](Galaxy.md#per-host-runtime-status-probes-hostscanstate).
### Attributes query (standard)
@@ -53,8 +66,8 @@ Returns user-defined dynamic attributes for deployed objects:
When `ExtendedAttributes = true`, a more comprehensive query runs that unions two sources:
1. **Primitive attributes** -- Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable.
2. **Dynamic attributes** -- The same CTE-based query as the standard path, with an empty `primitive_name`.
1. **Primitive attributes** Joins through `primitive_instance` and `attribute_definition` to include system-level attributes from primitive components. Each attribute carries its `primitive_name` so the address space can group them under their parent variable.
2. **Dynamic attributes** The same CTE-based query as the standard path, with an empty `primitive_name`.
The `full_tag_reference` for primitive attributes follows the pattern `tag_name.primitive_name.attribute_name` (e.g., `TestMachine_001.AlarmAttr.InAlarm`).
@@ -66,10 +79,10 @@ A single-column query: `SELECT time_of_last_deploy FROM galaxy`. The `galaxy` ta
The Galaxy maintains two package references for each object:
- `checked_in_package_id` -- The latest saved version, which may include undeployed configuration changes
- `deployed_package_id` -- The version currently running on the target platform
- `checked_in_package_id` — the latest saved version, which may include undeployed configuration changes
- `deployed_package_id` — the version currently running on the target platform
The queries filter on `deployed_package_id <> 0` because the OPC UA server must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime.
The queries filter on `deployed_package_id <> 0` because the OPC UA address space must mirror what is actually running in the Galaxy runtime. Using `checked_in_package_id` would expose attributes and objects that exist in the IDE but have not been deployed, causing mismatches between the OPC UA address space and the MXAccess runtime.
## Platform Scope Filter
@@ -77,21 +90,16 @@ When `Scope` is set to `LocalPlatform`, the repository applies a post-query C# f
### How it works
1. **Platform lookup** -- A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load.
2. **Platform matching** -- The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms, and the address space is empty.
3. **Host chain collection** -- The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform.
4. **Object inclusion** -- All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves.
5. **Area retention** -- `ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded.
6. **Attribute filtering** -- The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope.
1. **Platform lookup** A separate `const string` SQL query (`PlatformLookupSql`) reads `platform_gobject_id` and `node_name` from the `platform` table for all deployed platforms. This runs once per hierarchy load.
2. **Platform matching** — The configured `PlatformName` (or `Environment.MachineName` when null) is matched case-insensitively against the `node_name` column. If no match is found, a warning is logged listing the available platforms and the address space is empty.
3. **Host chain collection** The filter collects the matching platform's `gobject_id`, then iterates the hierarchy to find all `$AppEngine` (category 3) objects whose `HostedByGobjectId` equals the platform. This produces the full set of host gobject_ids under the local platform.
4. **Object inclusion** — All non-area objects whose `HostedByGobjectId` is in the host set are included, along with the hosts themselves.
5. **Area retention**`ParentGobjectId` chains are walked upward from included objects to pull in ancestor areas, keeping the browse tree connected. Areas that contain no local descendants are excluded.
6. **Attribute filtering** — The set of included `gobject_id` values is cached after `GetHierarchyAsync` and reused by `GetAttributesAsync` to filter attributes to the same scope.
### Design rationale
The filter is applied in C# rather than SQL because the project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query.
The filter is applied in C# rather than SQL because project convention `GR-006` requires `const string` SQL queries with no dynamic SQL. The hierarchy query already returns `HostedByGobjectId` and `CategoryId` on every row, so all information needed for filtering is already in memory after the query runs. The only new SQL is the lightweight platform lookup query.
### Configuration
@@ -102,7 +110,7 @@ The filter is applied in C# rather than SQL because the project convention `GR-0
}
```
- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything, backward compatible).
- Set `Scope` to `"LocalPlatform"` to enable filtering. Default is `"Galaxy"` (load everything).
- Set `PlatformName` to an explicit hostname to target a specific platform, or leave null to use the local machine name.
### Startup log
@@ -119,25 +127,26 @@ GetAttributesAsync returned 4206 attributes (extended=true)
Scope filter retained 2100 of 4206 attributes
```
## Change Detection Polling
## Change Detection Polling and IRediscoverable
`ChangeDetectionService` runs a background polling loop that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value:
`ChangeDetectionService` runs a background polling loop in the Host process that calls `GetLastDeployTimeAsync` at the configured interval. It compares the returned timestamp against the last known value:
- On the first poll (no previous state), the timestamp is recorded and `OnGalaxyChanged` fires unconditionally
- On subsequent polls, `OnGalaxyChanged` fires only when `time_of_last_deploy` differs from the cached value
When the event fires, the host service queries fresh hierarchy and attribute data from the repository and calls `LmxNodeManager.RebuildAddressSpace` (which delegates to incremental `SyncAddressSpace`).
When the event fires, the Host re-runs the hierarchy and attribute queries and pushes the result back to the Server via an IPC `RediscoveryNeeded` message. That surfaces on `GalaxyProxyDriver` as the **`IRediscoverable.OnRediscoveryNeeded`** event; the Server's `DriverNodeManager` consumes it and calls `SyncAddressSpace` to compute the diff against the live address space.
The polling approach is used because the Galaxy Repository database does not provide change notifications. The `galaxy.time_of_last_deploy` column updates only on completed deployments, so the polling interval controls how quickly the OPC UA address space reflects Galaxy changes.
## TestConnection
`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at service startup to verify connectivity before attempting the full hierarchy query.
`TestConnectionAsync` runs `SELECT 1` against the configured database. This is used at Host startup to verify connectivity before attempting the full hierarchy query.
## Key source files
- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs` -- SQL queries and data access
- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/PlatformScopeFilter.cs` -- Platform-based hierarchy and attribute filtering
- `src/ZB.MOM.WW.OtOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs` -- Deploy timestamp polling loop
- `src/ZB.MOM.WW.OtOpcUa.Host/Configuration/GalaxyRepositoryConfiguration.cs` -- Connection, polling, and scope settings
- `src/ZB.MOM.WW.OtOpcUa.Host/Domain/PlatformInfo.cs` -- Platform-to-hostname DTO
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/GalaxyRepositoryService.cs` SQL queries and data access
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/PlatformScopeFilter.cs` Platform-based hierarchy and attribute filtering
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/GalaxyRepository/ChangeDetectionService.cs` Deploy timestamp polling loop
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Configuration/GalaxyRepositoryConfiguration.cs` Connection, polling, and scope settings
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Domain/PlatformInfo.cs` Platform-to-hostname DTO
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/Contracts/DiscoveryResponse.cs` — IPC DTO the Host uses to return hierarchy + attribute results across the pipe

211
docs/drivers/Galaxy.md Normal file
View File

@@ -0,0 +1,211 @@
# Galaxy Driver
The Galaxy driver bridges OtOpcUa to AVEVA System Platform (Wonderware) Galaxies through the `ArchestrA.MxAccess` COM API plus the Galaxy Repository SQL database. It is one driver of seven in the OtOpcUa platform (see [drivers/README.md](README.md) for the full list); all other drivers run in-process in the main Server (.NET 10 x64). Galaxy is the exception — it runs as its own Windows service and talks to the Server over a local named pipe.
For the decision record on why Galaxy is out-of-process and how the refactor was staged, see [docs/v2/plan.md §4 Galaxy/MXAccess as Out-of-Process Driver](../v2/plan.md). For the full driver spec (addressing, data-type map, config shape), see [docs/v2/driver-specs.md §1](../v2/driver-specs.md).
## Project Split
Galaxy ships as three projects:
| Project | Target | Role |
|---------|--------|------|
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/` | .NET Standard 2.0 | IPC contracts (MessagePack records + `MessageKind` enum) referenced by both sides |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` | .NET Framework 4.8 **x86** | Separate Windows service hosting the MXAccess COM objects, STA thread + Win32 message pump, Galaxy Repository reader, Historian SDK, runtime-probe manager |
| `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/` | .NET 10 (matches Server) | `GalaxyProxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe` — loaded in-process by the Server; every call forwards over the pipe to the Host |
The Shared assembly is the **only** contract between the two runtimes. It carries no COM or SDK references so Proxy (net10) can reference it without dragging x86 code into the Server process.
## Why Out-of-Process
Two reasons drive the split, per `docs/v2/plan.md`:
1. **Bitness constraint.** MXAccess is 32-bit COM only — `ArchestrA.MxAccess.dll` in `Program Files (x86)\ArchestrA\Framework\bin` has no 64-bit variant. The main OtOpcUa Server is .NET 10 x64 (the OPC Foundation stack, SqlClient, and every other non-Galaxy driver target 64-bit). In-process hosting would force the whole Server to x86, which every other driver project would then inherit.
2. **Tier-C stability isolation.** Galaxy is classified Tier C in [docs/v2/driver-stability.md](../v2/driver-stability.md) — the COM runtime, STA thread, Aveva Historian SDK, and SQL queries all have crash/hang modes that can take down the hosting process. Isolating the driver in its own Windows service means a COM deadlock, AccessViolation in an unmanaged Historian DLL, or a runaway SQL query never takes the Server endpoint down. The Proxy-side supervisor restarts the Host with crash-loop circuit-breaker.
The same Tier-C isolation story applies to FOCAS (decision record in `docs/v2/plan.md` §7), which is the second out-of-process driver.
## IPC Transport
`GalaxyProxyDriver``GalaxyIpcClient` → named pipe → `Galaxy.Host` pipe server.
- Pipe name: `otopcua-galaxy-{DriverInstanceId}` (localhost-only, no TCP surface)
- Wire format: MessagePack-CSharp, length-prefixed frames
- ACL: pipe is created with a DACL that grants only the Server's service identity; the Admins group is explicitly denied so a live-smoke test running from an elevated shell fails fast rather than silently bypassing the handshake
- Handshake: Proxy presents a shared secret at `OpenSessionRequest`; Host rejects anything else with `MessageKind.OpenSessionResponse{Success=false}`
- Heartbeat: Proxy sends a periodic ping; missed heartbeats trigger the Proxy-side crash-loop supervisor to restart the Host
Every capability call on `GalaxyProxyDriver` (Read, Write, Subscribe, HistoryRead*, etc.) serializes a `*Request`, awaits the matching `*Response` via a `CallAsync<TReq, TResp>` helper, and rehydrates the result into the `Core.Abstractions` shape the Server expects.
## STA Thread Requirement (Host-side)
MXAccess COM objects — `LMXProxyServer` instantiation, `Register`, `AddItem`, `AdviseSupervisory`, `Write`, and cleanup calls — must all execute on the same Single-Threaded Apartment. Calling a COM object from the wrong thread causes marshalling failures or silent data corruption.
`StaComThread` in the Host provides that thread with the apartment state set before the thread starts:
```csharp
_thread = new Thread(ThreadEntry) { Name = "MxAccess-STA", IsBackground = true };
_thread.SetApartmentState(ApartmentState.STA);
```
Work items queue via `RunAsync(Action)` or `RunAsync<T>(Func<T>)` into a `ConcurrentQueue<Action>` and post `WM_APP` to wake the pump. Each work item is wrapped in a `TaskCompletionSource` so callers can `await` the result from any thread — including the IPC handler thread that receives the inbound pipe request.
## Win32 Message Pump (Host-side)
COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered through the Windows message loop. `StaComThread` runs a standard Win32 message pump via P/Invoke:
1. `PeekMessage` primes the message queue (required before `PostThreadMessage` works)
2. `GetMessage` blocks until a message arrives
3. `WM_APP` drains the work queue
4. `WM_APP + 1` drains the queue and posts `WM_QUIT` to exit the loop
5. All other messages go through `TranslateMessage` / `DispatchMessage` for COM callback delivery
Without this pump MXAccess callbacks never fire and the driver delivers no live data.
## LMXProxyServer COM Object
`MxProxyAdapter` wraps the real `ArchestrA.MxAccess.LMXProxyServer` COM object behind the `IMxProxy` interface so Host unit tests can substitute a fake proxy without requiring the ArchestrA runtime. Lifecycle:
1. **`Register(clientName)`** — Creates a new `LMXProxyServer` instance, wires up `OnDataChange` and `OnWriteComplete` event handlers, calls `Register` to obtain a connection handle
2. **`Unregister(handle)`** — Unwires event handlers, calls `Unregister`, releases the COM object via `Marshal.ReleaseComObject`
## Register / AddItem / AdviseSupervisory Pattern
Every MXAccess data operation follows a three-step pattern, all executed on the STA thread:
1. **`AddItem(handle, address)`** — Resolves a Galaxy tag reference (e.g., `TestMachine_001.MachineID`) to an integer item handle
2. **`AdviseSupervisory(handle, itemHandle)`** — Subscribes the item for supervisory data-change callbacks
3. The runtime begins delivering `OnDataChange` events
For writes, after `AddItem` + `AdviseSupervisory`, `Write(handle, itemHandle, value, securityClassification)` sends the value; `OnWriteComplete` confirms or rejects. Cleanup reverses: `UnAdviseSupervisory` then `RemoveItem`.
## OnDataChange and OnWriteComplete Callbacks
### OnDataChange
Fired by the COM runtime on the STA thread when a subscribed tag changes. The handler in `MxAccessClient.EventHandlers.cs`:
1. Maps the integer `phItemHandle` back to a tag address via `_handleToAddress`
2. Maps the MXAccess quality code to the internal `Quality` enum
3. Checks `MXSTATUS_PROXY` for error details and adjusts quality
4. Converts the timestamp to UTC
5. Constructs a `Vtq` (Value/Timestamp/Quality) and delivers it to:
- The stored per-tag subscription callback
- Any pending one-shot read completions
- The global `OnTagValueChanged` event (consumed by the Host's subscription dispatcher, which packages changes into `DataChangeEventArgs` and forwards them over the pipe to `GalaxyProxyDriver.OnDataChange`)
### OnWriteComplete
Fired when the runtime acknowledges or rejects a write. The handler resolves the pending `TaskCompletionSource<bool>` for the item handle. If `MXSTATUS_PROXY.success == 0` the write is considered failed and the error detail is logged.
## Reconnection Logic
`MxAccessClient` implements automatic reconnection through two mechanisms.
### Monitor loop
`StartMonitor` launches a background task that polls at `MonitorIntervalSeconds`. On each cycle:
- If the state is `Disconnected` or `Error` and `AutoReconnect` is enabled, it calls `ReconnectAsync`
- If connected and a probe tag is configured, it checks the probe staleness threshold
### Reconnect sequence
`ReconnectAsync` performs a full disconnect-then-connect cycle:
1. Increment the reconnect counter
2. `DisconnectAsync` — tear down all active subscriptions (`UnAdviseSupervisory` + `RemoveItem` for each), detach COM event handlers, call `Unregister`, clear all handle mappings
3. `ConnectAsync` — create a fresh `LMXProxyServer`, register, replay all stored subscriptions, re-subscribe the probe tag
Stored subscriptions (`_storedSubscriptions`) persist across reconnects. `ReplayStoredSubscriptionsAsync` iterates the stored entries and calls `AddItem` + `AdviseSupervisory` for each.
## Probe Tag Health Monitoring
A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves as a connection health indicator. After connecting, the client subscribes to the probe tag and records `_lastProbeValueTime` on every `OnDataChange`. The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`; if the probe has not updated within the window, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
Separate from the connection-level probe, the driver advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the Admin UI dashboard can report "this specific Platform / AppEngine is off scan" and the driver can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MXAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](../Configuration.md#mxaccess) for the two config fields.
### How it works
`GalaxyRuntimeProbeManager` lives in `Driver.Galaxy.Host` alongside the rest of the MXAccess code. It is owned by the Host's subscription dispatcher and runs a three-state machine per host (Unknown / Running / Stopped):
1. **Discovery** — After the Host completes `BuildAddressSpace`, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are driver-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors) means **Stopped**.
3. **On-change-only delivery**`ScanState` is delivered only when the value actually changes. A stably Running host may go hours without a callback. `Tick()` does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown`. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped".
5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Stability review 2026-04-13 Finding 1.
### Subtree quality invalidation on transition
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. **Stopped → Running** calls `ClearHostVariablesBadQuality` to reset each to `Good` so the next on-change MXAccess update repopulates the value.
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform lands in both the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
### Read-path short-circuit (`IsTagUnderStoppedHost`)
The Host's Read handler checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]``GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MXAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MXAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
### Deferred dispatch — the STA deadlock
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the subscription dispatcher lock, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — outside any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural lock acquisition. No circular wait, no STA involvement.
### Dashboard and health surface
- Admin UI **Galaxy Runtime** panel shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MXAccess transport disconnected)
- `HealthCheckService.CheckHealth` rolls overall driver health to `Degraded` when any host is Stopped
See [Status Dashboard](../StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](../Configuration.md#mxaccess) for the config fields.
## Request Timeout Safety Backstop
Every sync-over-async site on the OPC UA stack thread that calls into Galaxy (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). Inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path are the first line of defense; the outer wrapper is a backstop so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because Galaxy IPC clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
All capability calls at the Server dispatch layer are additionally wrapped by `CapabilityInvoker` (Core/Resilience/) which runs them through a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. `OTOPCUA0001` analyzer enforces the wrap at build time.
## Why Marshal.ReleaseComObject Is Needed
The .NET Framework runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately drive the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
## Tag Discovery and Historical Data
Tag discovery (the Galaxy Repository SQL reader + `LocalPlatform` scope filter) is covered in [Galaxy-Repository.md](Galaxy-Repository.md). The Galaxy driver is `ITagDiscovery` for the Server's bootstrap path and `IRediscoverable` for the on-change-redeploy path.
Historical data access (raw, processed, at-time, events) runs against the Aveva Historian via the `aahClientManaged` SDK and is exposed through the Galaxy driver's `IHistoryProvider` implementation. See [HistoricalDataAccess.md](../HistoricalDataAccess.md).
## Key source files
Host-side (`.NET 4.8 x86`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/`):
- `Backend/MxAccess/StaComThread.cs` — STA thread and Win32 message pump
- `Backend/MxAccess/MxAccessClient.cs` — Core client (partial)
- `Backend/MxAccess/MxAccessClient.Connection.cs` — Connect / disconnect / reconnect
- `Backend/MxAccess/MxAccessClient.Subscription.cs` — Subscribe / unsubscribe / replay
- `Backend/MxAccess/MxAccessClient.ReadWrite.cs` — Read and write operations
- `Backend/MxAccess/MxAccessClient.EventHandlers.cs``OnDataChange` / `OnWriteComplete` handlers
- `Backend/MxAccess/MxAccessClient.Monitor.cs` — Background health monitor
- `Backend/MxAccess/MxProxyAdapter.cs` — COM object wrapper
- `Backend/MxAccess/GalaxyRuntimeProbeManager.cs` — Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
- `Backend/Historian/HistorianDataSource.cs``aahClientManaged` SDK wrapper (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
- `Ipc/GalaxyIpcServer.cs` — Named-pipe server, message dispatch
- `Domain/IMxAccessClient.cs` — Client interface
Shared (`.NET Standard 2.0`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared/`):
- `Contracts/MessageKind.cs` — IPC message kinds (`ReadRequest`, `HistoryReadRequest`, `OpenSessionResponse`, …)
- `Contracts/*.cs` — MessagePack DTOs for every request/response pair
Proxy-side (`.NET 10`, `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/`):
- `GalaxyProxyDriver.cs``IDriver`/`ITagDiscovery`/`IReadable`/`IWritable`/`ISubscribable`/`IAlarmSource`/`IHistoryProvider`/`IRediscoverable`/`IHostConnectivityProbe` implementation; every method forwards via `GalaxyIpcClient`
- `Ipc/GalaxyIpcClient.cs` — Named-pipe client, `CallAsync<TReq, TResp>`, reconnect on broken pipe
- `GalaxyProxySupervisor.cs` — Host-process monitor, crash-loop circuit-breaker, Host relaunch

46
docs/drivers/README.md Normal file
View File

@@ -0,0 +1,46 @@
# Drivers
OtOpcUa is a multi-driver OPC UA server. The Core (`ZB.MOM.WW.OtOpcUa.Core` + `Core.Abstractions` + `Server`) owns the OPC UA stack, address space, session/security/subscription machinery, resilience pipeline, and namespace kinds (Equipment + SystemPlatform). Drivers plug in through **capability interfaces** defined in `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/`:
- `IDriver` — lifecycle (`InitializeAsync`, `ReinitializeAsync`, `ShutdownAsync`, `GetHealth`)
- `IReadable` / `IWritable` — one-shot reads and writes
- `ITagDiscovery` — address-space enumeration
- `ISubscribable` — driver-pushed data-change streams
- `IHostConnectivityProbe` — per-host reachability events
- `IPerCallHostResolver` — multi-host drivers that route each call to a target endpoint at dispatch time
- `IAlarmSource` — driver-emitted OPC UA A&C events
- `IHistoryProvider` — raw / processed / at-time / events HistoryRead (see [HistoricalDataAccess.md](../HistoricalDataAccess.md))
- `IRediscoverable` — driver-initiated address-space rebuild notifications
Each driver opts into only the capabilities it supports. Every async capability call at the Server dispatch layer goes through `CapabilityInvoker` (`Core/Resilience/CapabilityInvoker.cs`), which wraps it in a Polly pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. The `OTOPCUA0001` analyzer enforces the wrap at build time. Drivers themselves never depend on Polly; they just implement the capability interface and let the Core wrap it.
Driver type metadata is registered at startup in `DriverTypeRegistry` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs`). The registry records each type's allowed namespace kinds (`Equipment` / `SystemPlatform` / `Simulated`), its JSON Schema for `DriverConfig` / `DeviceConfig` / `TagConfig` columns, and its stability tier per [docs/v2/driver-stability.md](../v2/driver-stability.md).
## Ground-truth driver list
| Driver | Project path | Tier | Wire / library | Capabilities | Notable quirk |
|--------|--------------|:----:|----------------|--------------|---------------|
| [Galaxy](Galaxy.md) | `Driver.Galaxy.{Shared, Host, Proxy}` | C | MXAccess COM + `aahClientManaged` + SqlClient | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe | Out-of-process — Host is its own Windows service (.NET 4.8 x86 for the COM bitness constraint); Proxy talks to Host over a named pipe |
| Modbus TCP | `Driver.Modbus` | A | NModbus-derived in-house client | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Polled subscriptions via the shared `PollGroupEngine`. DL205 PLCs are covered by `AddressFormat=DL205` (octal V/X/Y/C/T/CT translation) — no separate driver |
| Siemens S7 | `Driver.S7` | A | [S7netplus](https://github.com/S7NetPlus/s7netplus) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe | Single S7netplus `Plc` instance per PLC serialized with `SemaphoreSlim` — the S7 CPU's comm mailbox is scanned at most once per cycle, so parallel reads don't help |
| AB CIP | `Driver.AbCip` | A | libplctag CIP | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | ControlLogix / CompactLogix. Tag discovery uses the `@tags` walker to enumerate controller-scoped + program-scoped symbols; UDT member resolution via the UDT template reader |
| AB Legacy | `Driver.AbLegacy` | A | libplctag PCCC | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | SLC 500 / MicroLogix. File-based addressing (`N7:0`, `F8:0`) — no symbol table, tag list is user-authored in the config DB |
| TwinCAT | `Driver.TwinCAT` | B | Beckhoff `TwinCAT.Ads` (`TcAdsClient`) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | The only native-notification driver outside Galaxy — ADS delivers `ValueChangedCallback` events the driver forwards straight to `ISubscribable.OnDataChange` without polling. Symbol tree uploaded via `SymbolLoaderFactory` |
| FOCAS | `Driver.FOCAS` | C | FANUC FOCAS2 (`Fwlib32.dll` P/Invoke) | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IHostConnectivityProbe, IPerCallHostResolver | Tier C — FOCAS DLL has crash modes that warrant process isolation. CNC-shaped data model (axes, spindle, PMC, macros, alarms) not a flat tag map |
| OPC UA Client | `Driver.OpcUaClient` | B | OPCFoundation `Opc.Ua.Client` | IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IHostConnectivityProbe | Gateway/aggregation driver. Opens a single `Session` against a remote OPC UA server and re-exposes its address space. Owns its own `ApplicationConfiguration` (distinct from `Client.Shared`) because it's always-on with keep-alive + `TransferSubscriptions` across SDK reconnect, not an interactive CLI |
## Per-driver documentation
- **Galaxy** has its own docs in this folder because the out-of-process architecture + MXAccess COM rules + Galaxy Repository SQL + Historian + runtime probe manager don't fit a single table row:
- [Galaxy.md](Galaxy.md) — COM bridge, STA pump, IPC, runtime probes
- [Galaxy-Repository.md](Galaxy-Repository.md) — ZB SQL reader, `LocalPlatform` scope filter, change detection
- **All other drivers** share a single per-driver specification in [docs/v2/driver-specs.md](../v2/driver-specs.md) — addressing, data-type maps, connection settings, and quirks live there. That file is the authoritative per-driver reference; this index points at it rather than duplicating.
## Related cross-driver docs
- [HistoricalDataAccess.md](../HistoricalDataAccess.md) — `IHistoryProvider` dispatch, aggregate mapping, continuation points. The Galaxy driver's Aveva Historian implementation is the first; OPC UA Client forwards to the upstream server; other drivers do not implement the interface and return `BadHistoryOperationUnsupported`.
- [AlarmTracking.md](../AlarmTracking.md) — `IAlarmSource` event model and filtering.
- [Subscriptions.md](../Subscriptions.md) — how the Server multiplexes subscriptions onto `ISubscribable.OnDataChange`.
- [docs/v2/driver-stability.md](../v2/driver-stability.md) — tier system (A / B / C), shared `CapabilityPolicy` defaults per tier × capability, `MemoryTracking` hybrid formula, and process-level recycle rules.
- [docs/v2/plan.md](../v2/plan.md) — authoritative vision, architecture decisions, migration strategy.

View File

@@ -1,8 +1,10 @@
# OPC UA Client Requirements
## Overview
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). The Client surface (shared library + CLI + UI) shipped for v2 is preserved; this refresh restructures the document into numbered, directly-verifiable requirements (CLI-* and UI-* prefixes) layered on top of the existing detailed design content. Requirement coverage added for the `redundancy` command, alarm subscribe/ack round-trip, history-read, and UI tree-browser drag-to-subscribe behaviors. Original design-spec material for `ConnectionSettings`, `IOpcUaClientService`, models, and view-models is retained as reference-level details below the numbered requirements.
Three new .NET 10 cross-platform projects providing a shared OPC UA client library, a CLI tool, and an Avalonia desktop UI. All projects target Windows and macOS.
Parent: [HLR-001](HighLevelReqs.md#hlr-001-opc-ua-server), [HLR-009](HighLevelReqs.md#hlr-009-transport-security-and-authentication), [HLR-013](HighLevelReqs.md#hlr-013-cluster-redundancy)
See also: `docs/Client.CLI.md`, `docs/Client.UI.md`.
## Projects
@@ -10,134 +12,161 @@ Three new .NET 10 cross-platform projects providing a shared OPC UA client libra
|---------|------|---------|
| `ZB.MOM.WW.OtOpcUa.Client.Shared` | Class library | Core OPC UA client, models, interfaces |
| `ZB.MOM.WW.OtOpcUa.Client.CLI` | Console app | Command-line interface using CliFx |
| `ZB.MOM.WW.OtOpcUa.Client.UI` | Avalonia app | Desktop UI with tree browser, subscriptions, alarms |
| `ZB.MOM.WW.OtOpcUa.Client.Shared.Tests` | Test project | Unit tests for shared library |
| `ZB.MOM.WW.OtOpcUa.Client.CLI.Tests` | Test project | Unit tests for CLI commands |
| `ZB.MOM.WW.OtOpcUa.Client.UI.Tests` | Test project | Unit tests for UI view models |
| `ZB.MOM.WW.OtOpcUa.Client.UI` | Avalonia app | Desktop UI |
| `ZB.MOM.WW.OtOpcUa.Client.Shared.Tests` | Test project | Shared-library unit tests |
| `ZB.MOM.WW.OtOpcUa.Client.CLI.Tests` | Test project | CLI command tests |
| `ZB.MOM.WW.OtOpcUa.Client.UI.Tests` | Test project | ViewModel unit tests |
## Shared Requirements (Client.Shared)
### SHR-001: Single Service Interface
The Client.Shared library shall expose a single service interface `IOpcUaClientService` covering connect, disconnect, read, write, browse, subscribe, alarm-subscribe, alarm-ack, history-read-raw, history-read-aggregate, and get-redundancy-info operations.
### SHR-002: ConnectionSettings Model
The library shall expose a `ConnectionSettings` record with the fields: `EndpointUrl` (required), `FailoverUrls[]`, `Username`, `Password`, `SecurityMode` (None/Sign/SignAndEncrypt; default None), `SessionTimeoutSeconds` (default 60), `AutoAcceptCertificates` (default true), `CertificateStorePath`.
### SHR-003: Automatic Failover
The library shall monitor session keep-alive and automatically fail over across `FailoverUrls` when the primary endpoint is unreachable, emitting a `ConnectionStateChanged` event on each transition (Disconnected / Connecting / Connected / Reconnecting).
### SHR-004: Cross-Platform Certificate Store
The library shall auto-generate a client certificate on first use and store it in a cross-platform path (default `{AppData}/OtOpcUaClient/pki/`). Server certificates are auto-accepted when `AutoAcceptCertificates = true`.
### SHR-005: Type-Coercing Write
The library's `WriteValueAsync(NodeId, object)` shall read the node's current value to determine target type and coerce the input value before sending.
### SHR-006: UI-Thread Dispatch Neutrality
The library shall not assume any specific synchronization context. Events (`DataChanged`, `AlarmEvent`, `ConnectionStateChanged`) are raised on the OPC UA stack thread; the consuming CLI / UI is responsible for dispatching to its UI thread.
---
## CLI Requirements (Client.CLI)
### CLI-001: Command Surface
The CLI shall expose the following commands: `connect`, `read`, `write`, `browse`, `subscribe`, `historyread`, `alarms`, `redundancy`.
### CLI-002: Common Options
All CLI commands shall accept the options `-u, --url` (required), `-U, --username`, `-P, --password`, `-S, --security none|sign|encrypt`, `-F, --failover-urls` (comma-separated), `--verbose`.
### CLI-003: Connect Command
The `connect` command shall attempt to establish a session using the supplied options and print `Connected` plus the resolved endpoint's `ServerUriArray` and `ApplicationUri` on success, or a diagnostic error message on failure.
### CLI-004: Read Command
The `read -n <NodeId>` command shall print `NodeId`, `Value`, `StatusCode`, `SourceTimestamp`, `ServerTimestamp` one per line.
### CLI-005: Write Command
The `write -n <NodeId> -v <value>` command shall coerce the value to the node's current type (per SHR-005) and print the resulting `StatusCode`. A `Bad_UserAccessDenied` result is printed verbatim so operators see the authorization outcome.
### CLI-006: Browse Command
The `browse [-n <parent>] [-r] [-d <depth>]` command shall list child nodes under `parent` (or the `Objects` folder if omitted). `-r` enables recursion up to `-d` depth (default 1).
### CLI-007: Subscribe Command
The `subscribe -n <NodeId> -i <intervalMs>` command shall create a monitored item at `intervalMs` publishing interval, print each `DataChanged` event as `<timestamp> <nodeId> <value> <status>` until Ctrl-C, then cleanly unsubscribe.
### CLI-008: Historyread Command
The `historyread -n <NodeId> --start <utc> --end <utc> [--max <n>] [--aggregate <type> --interval <ms>]` command shall print raw values or aggregate buckets. Supported aggregate types: Average, Minimum, Maximum, Count, Start, End.
### CLI-009: Alarms Command
The `alarms [-n <source>] [-i <intervalMs>]` command shall subscribe to alarm events, print each event as `<time> <source> <condition> <severity> <state> <acked> <message>`, accept `ack <conditionId>` commands interactively, and support `refresh` to trigger `RequestConditionRefreshAsync`.
### CLI-010: Redundancy Command
The `redundancy` command shall call `GetRedundancyInfoAsync` and print `Mode`, `ServiceLevel`, `ApplicationUri`, and `ServerUris` (one per line). Suitable for redundancy-failover smoke tests.
### CLI-011: Logging
The CLI shall use Serilog console sink at `Warning` minimum by default; `--verbose` raises to `Debug`.
---
## UI Requirements (Client.UI)
### UI-001: Connection Panel
The UI shall present a top-bar connection panel with fields for Endpoint URL, Username, Password, Security mode, and a Connect / Disconnect button. The resolved `RedundancyInfo` is displayed next to the bar on successful connect.
### UI-002: Tree Browser
The UI shall present a left-pane tree browser backed by `IOpcUaClientService.BrowseAsync`, lazy-loading children on node expansion (one level per `BrowseAsync` call).
### UI-003: Read/Write Tab
The UI shall provide a Read/Write tab that auto-reads the selected tree node's current value, displays `Value` + `StatusCode` + `SourceTimestamp`, and accepts a write value with a Send button.
### UI-004: Subscriptions Tab
The UI shall provide a Subscriptions tab that lists active monitored items (columns: NodeId, Value, Status, Timestamp), supports Add and Remove, and dispatches `DataChanged` events to the Avalonia UI thread via `Dispatcher.UIThread.Post`.
### UI-005: Alarms Tab
The UI shall provide an Alarms tab that supports SubscribeAlarms / UnsubscribeAlarms / RefreshConditions commands, displays live alarm events, and supports `Acknowledge` on selected events. Acknowledgment failure (including `Bad_UserAccessDenied`) is surfaced to the user.
### UI-006: History Tab
The UI shall provide a History tab with inputs for StartTime, EndTime, MaxValues, AggregateType, Interval, a Read command, and a results table with columns (Timestamp, Value, Status).
### UI-007: Connection State Reflects in UI
All tabs shall reflect the connection state — when disconnected, all action commands are disabled; the status bar shows `Disconnected` / `Connecting` / `Connected` / `Reconnecting` tied to the `ConnectionStateChanged` event.
### UI-008: Cross-Platform
The UI shall build and run on Windows (win-x64) and macOS (osx-arm64 / osx-x64). No platform-specific OPC UA stack APIs are used.
---
## Technology Stack
- .NET 10, C#
- OPC UA: OPCFoundation.NetStandard.Opc.Ua.Client
- OPC UA: `OPCFoundation.NetStandard.Opc.Ua.Client`
- Logging: Serilog
- CLI: CliFx
- UI: Avalonia 11.x with CommunityToolkit.Mvvm
- Tests: xUnit 3, Shouldly, Microsoft.Testing.Platform runner
## Client.Shared
## Client.Shared — Design Detail
### ConnectionSettings Model
### IOpcUaClientService Interface (reference)
```
EndpointUrl: string (required)
FailoverUrls: string[] (optional)
Username: string? (optional, first-class property)
Password: string? (optional, first-class property)
SecurityMode: enum (None, Sign, SignAndEncrypt) — default None
SessionTimeoutSeconds: int — default 60
AutoAcceptCertificates: bool — default true
CertificateStorePath: string? — default platform-appropriate location
```
**Lifecycle:** `ConnectAsync(ConnectionSettings)`, `DisconnectAsync()`, `IsConnected`.
### IOpcUaClientService Interface
**Read/Write:** `ReadValueAsync(NodeId)`, `WriteValueAsync(NodeId, object)`.
Single service interface covering all OPC UA operations:
**Browse:** `BrowseAsync(NodeId? parent)``BrowseResult[]` (NodeId, DisplayName, NodeClass, HasChildren); lazy-load compatible.
**Lifecycle:**
- `ConnectAsync(ConnectionSettings)` — connect to server, handle endpoint discovery, security, auth
- `DisconnectAsync()` — close session cleanly
- `IsConnected` property
**Subscribe:** `SubscribeAsync(NodeId, int intervalMs)`, `UnsubscribeAsync(NodeId)`, `event DataChanged(NodeId, DataValue)`.
**Read/Write:**
- `ReadValueAsync(NodeId)` — returns DataValue (value, status, timestamps)
- `WriteValueAsync(NodeId, object value)` — auto-detects target type, returns StatusCode
**Alarms:** `SubscribeAlarmsAsync(NodeId? source, int intervalMs)`, `UnsubscribeAlarmsAsync()`, `AcknowledgeAsync(conditionId, comment)`, `RequestConditionRefreshAsync()`, `event AlarmEvent(AlarmEventArgs)`.
**Browse:**
- `BrowseAsync(NodeId? parent)` — returns list of BrowseResult (NodeId, DisplayName, NodeClass)
- Lazy-load compatible (browse one level at a time)
**History:** `HistoryReadRawAsync(NodeId, start, end, maxValues)`, `HistoryReadAggregateAsync(NodeId, start, end, AggregateType, intervalMs)`.
**Subscribe:**
- `SubscribeAsync(NodeId, int intervalMs)` — create monitored item subscription
- `UnsubscribeAsync(NodeId)` — remove monitored item
- `event DataChanged` — fires on value change with (NodeId, DataValue)
**Alarms:**
- `SubscribeAlarmsAsync(NodeId? source, int intervalMs)` — subscribe to alarm events
- `UnsubscribeAlarmsAsync()` — remove alarm subscription
- `RequestConditionRefreshAsync()` — trigger condition refresh
- `event AlarmEvent` — fires on alarm state change with AlarmEventArgs
**History:**
- `HistoryReadRawAsync(NodeId, DateTime start, DateTime end, int maxValues)` — raw historical values
- `HistoryReadAggregateAsync(NodeId, DateTime start, DateTime end, AggregateType, double intervalMs)` — aggregated values
**Redundancy:**
- `GetRedundancyInfoAsync()` — returns RedundancyInfo (mode, service level, server URIs, app URI)
**Failover:**
- Automatic failover across FailoverUrls with keep-alive monitoring
- `event ConnectionStateChanged` — fires on connect/disconnect/failover
**Redundancy:** `GetRedundancyInfoAsync()``RedundancyInfo` (Mode, ServiceLevel, ServerUris, ApplicationUri).
### Models
- `BrowseResult`: NodeId, DisplayName, NodeClass, HasChildren
- `AlarmEventArgs`: SourceName, ConditionName, Severity, Message, Retain, ActiveState, AckedState, Time
- `RedundancyInfo`: Mode, ServiceLevel, ServerUris, ApplicationUri
- `ConnectionState`: enum (Disconnected, Connecting, Connected, Reconnecting)
- `AggregateType`: enum (Average, Minimum, Maximum, Count, Start, End)
- `BrowseResult` NodeId, DisplayName, NodeClass, HasChildren
- `AlarmEventArgs` SourceName, ConditionName, Severity, Message, Retain, ActiveState, AckedState, Time
- `RedundancyInfo` Mode, ServiceLevel, ServerUris, ApplicationUri
- `ConnectionState` enum (Disconnected, Connecting, Connected, Reconnecting)
- `AggregateType` enum (Average, Minimum, Maximum, Count, Start, End)
### Type Conversion
---
Port the existing `ConvertValue` logic from the CLI tool: reads the current node value to determine the target type, then coerces the input value.
### Certificate Management
- Cross-platform certificate store path (default: `{AppData}/LmxOpcUaClient/pki/`)
- Auto-generate client certificate on first use
- Auto-accept untrusted server certificates (configurable)
### Logging
Serilog with `ILogger` passed via constructor or `Log.ForContext<T>()`. No sinks configured in the library — consumers configure sinks.
## Client.CLI
### Commands
All 8 commands:
| Command | Description |
|---------|-------------|
| `connect` | Test server connectivity |
| `read` | Read a node value |
| `write` | Write a value to a node |
| `browse` | Browse address space (with depth/recursive) |
| `subscribe` | Monitor node for value changes |
| `historyread` | Read historical data (raw + aggregates) |
| `alarms` | Subscribe to alarm events |
| `redundancy` | Query redundancy state |
All commands use the shared `IOpcUaClientService`. Each command:
1. Creates `ConnectionSettings` from CLI options
2. Creates `OpcUaClientService`
3. Calls the appropriate method
4. Formats and prints results
### Common Options (all commands)
- `-u, --url` (required): Endpoint URL
- `-U, --username`: Username
- `-P, --password`: Password
- `-S, --security`: Security mode (none/sign/encrypt)
- `-F, --failover-urls`: Comma-separated failover endpoints
### Logging
Serilog console sink at Warning level by default, with `--verbose` flag for Debug.
## Client.UI
### Window Layout
## Client.UI — View Layout (reference)
Single-window Avalonia application:
@@ -146,82 +175,43 @@ Single-window Avalonia application:
│ [Endpoint URL] [User] [Pass] [Security▼] [Connect] │
│ Redundancy: Mode=Warm ServiceLevel=200 AppUri=... │
├──────────────┬──────────────────────────────────────────┤
│ │ ┌Read/Write─┬─Subscriptions─┬─Alarms─┬─History┐│
│ Address │ │ Node: ns=3;s=Tag.Attr ││
│ Space │ │ Value: 42.5 ││
│ Tree │ │ Status: Good ││
│ Browser │ │ [Write: ____] [Send] │
│ │ │ ││
│ (lazy-load) │ │ ││
│ │ └──────────────────────────────────────┘│
│ │ ┌Read/WriteSubscriptionsAlarmsHistory┐│
│ Address │ │ Node: ns=3;s=Tag.Attr ││
│ Space │ │ Value: 42.5 Status: Good ││
│ Tree │ │ [Write: ____] [Send] ││
│ Browser │ └───────────────────────────────────────┘
├──────────────┴──────────────────────────────────────────┤
│ Status: Connected | Session: abc123 | 3 subscriptions │
└─────────────────────────────────────────────────────────┘
```
### Views and ViewModels (CommunityToolkit.Mvvm)
### ViewModels (CommunityToolkit.Mvvm)
**MainWindowViewModel:**
- Connection settings properties (bound to top bar inputs)
- ConnectCommand / DisconnectCommand (RelayCommand)
- ConnectionState property
- RedundancyInfo property
- SelectedTreeNode property
- StatusMessage property
**BrowseTreeViewModel:**
- Root nodes collection (ObservableCollection)
- Lazy-load children on expand via `BrowseAsync`
- TreeNodeViewModel: NodeId, DisplayName, NodeClass, Children, IsExpanded, HasChildren
**ReadWriteViewModel:**
- SelectedNode (from tree selection)
- CurrentValue, Status, SourceTimestamp
- WriteValue input + WriteCommand
- Auto-read on node selection
**SubscriptionsViewModel:**
- ActiveSubscriptions collection (ObservableCollection)
- AddSubscription / RemoveSubscription commands
- Live value updates dispatched to UI thread
- Columns: NodeId, Value, Status, Timestamp
**AlarmsViewModel:**
- AlarmEvents collection (ObservableCollection)
- SubscribeCommand / UnsubscribeCommand / RefreshCommand
- MonitoredNode property
- Live alarm events dispatched to UI thread
**HistoryViewModel:**
- SelectedNode (from tree selection)
- StartTime, EndTime, MaxValues, AggregateType, Interval
- ReadCommand
- Results collection (ObservableCollection)
- Columns: Timestamp, Value, Status
### UI Thread Dispatch
All events from `IOpcUaClientService` must be dispatched to the Avalonia UI thread via `Dispatcher.UIThread.Post()` before updating ObservableCollections.
- `MainWindowViewModel` — connection fields, connect/disconnect commands, `ConnectionState`, `RedundancyInfo`, `SelectedTreeNode`, `StatusMessage`.
- `BrowseTreeViewModel` — root collection (`ObservableCollection<TreeNodeViewModel>`), lazy-load on expand.
- `ReadWriteViewModel` — auto-read on selection, `WriteValue` + `WriteCommand`.
- `SubscriptionsViewModel``ActiveSubscriptions`, `AddSubscription`, `RemoveSubscription`, live `DataChanged` dispatch to UI thread.
- `AlarmsViewModel``AlarmEvents`, Subscribe / Unsubscribe / Refresh / Acknowledge commands.
- `HistoryViewModel``StartTime`, `EndTime`, `MaxValues`, `AggregateType`, `Interval`, `ReadCommand`, `Results`.
## Test Projects
### Client.Shared.Tests
- ConnectionSettings validation
- Type conversion (ConvertValue)
- BrowseResult model construction
- AlarmEventArgs model construction
- `ConnectionSettings` validation
- Type conversion
- `BrowseResult` / `AlarmEventArgs` / `RedundancyInfo` model construction
- FailoverUrl parsing
### Client.CLI.Tests
- Command option parsing (via CliFx test infrastructure)
- Output formatting
- Output formatting for each command
### Client.UI.Tests
- ViewModel property change notifications
- Command can-execute logic
- Tree node lazy-load behavior (with mocked IOpcUaClientService)
- ViewModel property-change notifications
- Command `CanExecute` logic
- Tree lazy-load behavior (with mocked `IOpcUaClientService`)
### Test Framework
- xUnit 3 with Microsoft.Testing.Platform runner
- Shouldly for assertions
- No live OPC UA server required — mock IOpcUaClientService for unit tests
- Shouldly
- No live OPC UA server required — mock `IOpcUaClientService` for unit tests

View File

@@ -1,106 +1,113 @@
# Galaxy Repository — Component Requirements
# Galaxy Driver — Galaxy Repository Requirements
Parent: [HLR-002](HighLevelReqs.md#hlr-002-galaxy-hierarchy-as-opc-ua-address-space), [HLR-005](HighLevelReqs.md#hlr-005-dynamic-address-space-rebuild)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope clarified: this document is **Galaxy-driver-specific**. Galaxy is one of seven drivers in the OtOpcUa platform; the requirements below describe the SQL-side of the Galaxy driver (hierarchy/attribute/change-detection queries against the ZB database) that backs the Galaxy driver's `ITagDiscovery.DiscoverAsync` and `IRediscoverable` implementations. All Galaxy-specific SQL runs inside `OtOpcUa.Galaxy.Host` (.NET 4.8 x86 Windows service); the in-server `Driver.Galaxy.Proxy` calls it over a named pipe. For platform-wide tag discovery requirements see `OpcUaServerReqs.md` OPC-002. For deeper spec see `docs/GalaxyRepository.md` and `docs/v2/driver-specs.md`.
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-003](HighLevelReqs.md#hlr-003-address-space-composition-per-namespace), [HLR-006](HighLevelReqs.md#hlr-006-change-detection-and-rediscovery)
Driver scope: Galaxy only. Namespace kind: `SystemPlatform`.
## GR-001: Hierarchy Extraction
The system shall query the Galaxy Repository database to extract all deployed objects with their parent-child containment relationships, contained names, and tag names.
The Galaxy driver's `ITagDiscovery.DiscoverAsync` implementation shall query the ZB Galaxy Repository database to extract all deployed objects with their parent-child containment relationships, contained names, and tag names.
### Acceptance Criteria
- Executes `queries/hierarchy.sql` against the ZB database.
- Executes `queries/hierarchy.sql` against the ZB database from within `OtOpcUa.Galaxy.Host`.
- Returns a list of objects with: `gobject_id`, `tag_name`, `contained_name`, `browse_name`, `parent_gobject_id`, `is_area`.
- Objects with `parent_gobject_id = 0` are children of the root ZB node.
- Objects with `parent_gobject_id = 0` become children of the root ZB node inside the `SystemPlatform` namespace.
- Only deployed, non-template objects matching the category filter (areas, engines, user-defined objects, etc.) are returned.
- Query completes within 10 seconds on a typical Galaxy (hundreds of objects). Log a Warning if it takes longer.
- Query completes within 10 seconds on a typical Galaxy (hundreds of objects). Log Warning if it takes longer.
### Details
- Results are ordered by `parent_gobject_id, tag_name` for deterministic tree building.
- If the query returns zero rows, log a Warning (Galaxy may have no deployed objects, or the DB connection may be misconfigured).
- Orphan detection: if a row references a `parent_gobject_id` that does not exist in the result set and is not 0, log a Warning and skip that node.
- Empty result → Warning logged (Galaxy may have no deployed objects, or the DB connection may be misconfigured).
- Orphan detection: a row referencing a non-existent `parent_gobject_id` (and not 0) is skipped with a Warning.
- Streamed to the core via `IAddressSpaceBuilder.AddFolder` / `AddObject` calls over the Galaxy named pipe; no in-memory full-tree buffering on the Host side.
---
## GR-002: Attribute Extraction
The system shall query user-defined (dynamic) attributes for deployed objects, including data type, array flag, and array dimensions.
The Galaxy driver shall query user-defined (dynamic) attributes for deployed objects, including data type, array flag, and array dimensions.
### Acceptance Criteria
- Executes `queries/attributes.sql` using the template chain CTE to resolve inherited attributes.
- Returns: `gobject_id`, `tag_name`, `attribute_name`, `full_tag_reference`, `mx_data_type`, `is_array`, `array_dimension`, `security_classification`.
- Attributes starting with `_` are filtered out by the query.
- `array_dimension` is correctly extracted from the `mx_value` hex bytes (positions 13-16, little-endian uint16).
- `array_dimension` is extracted from the `mx_value` hex bytes (positions 13-16, little-endian uint16).
### Details
- CTE recursion depth is limited to 10 levels (per the query). This is sufficient for Galaxy template hierarchies.
- If `mx_data_type` is null or not in the known set (1-8, 13-16), default to String.
- If `gobject_id` from an attribute row does not match any hierarchy object, skip that attribute (object may not be deployed).
- CTE recursion depth is limited to 10 levels.
- `mx_data_type` not in the known set (1-8, 13-16) defaults to String.
- `gobject_id` that doesn't match a hierarchy object is skipped (object may not be deployed).
- Each emitted attribute is reported via `DriverAttributeInfo` to the core through `IAddressSpaceBuilder.AddVariable`.
---
## GR-003: Change Detection
## GR-003: Change Detection and IRediscoverable
The system shall poll `galaxy.time_of_last_deploy` at a configurable interval to detect when a new deployment has occurred.
The Galaxy driver shall implement `IRediscoverable` by polling `galaxy.time_of_last_deploy` on a configurable interval to detect when a new deployment has occurred.
### Acceptance Criteria
- Polls `SELECT time_of_last_deploy FROM galaxy` at a configurable interval (`GalaxyRepository:ChangeDetectionIntervalSeconds`, default 30 seconds).
- Polls `SELECT time_of_last_deploy FROM galaxy` at a configurable interval (`Galaxy:ChangeDetectionIntervalSeconds`, default 30 seconds).
- Compares the returned timestamp to the last known value stored in memory.
- If different, triggers a rebuild (re-run hierarchy + attributes queries, notify OPC UA server).
- First poll after startup always triggers an initial build.
- If the query fails (SQL timeout, connection error), log Warning and retry at next interval. Do not trigger a rebuild on failure.
- If different, raises the `IRediscoverable.RediscoveryNeeded` signal so the core re-runs `ITagDiscovery.DiscoverAsync` and surgically rebuilds the Galaxy namespace subtree (per OPC-017).
- First poll after startup always triggers an initial discovery.
- Query failure → Warning logged; no rediscovery triggered; retry at next interval.
### Details
- Polling runs on a background timer thread, not blocking the STA thread.
- `time_of_last_deploy` is a datetime column. Compare using exact equality (not range).
- Polling runs on a background `Task` inside `OtOpcUa.Galaxy.Host`, not on the STA message-pump thread.
- `time_of_last_deploy` is a `datetime` column; compared using exact equality (not a range).
- Signal delivery to the Proxy happens via a server-push message on the Galaxy named pipe.
---
## GR-004: Rebuild on Change
## GR-004: Rediscovery Data Flow
When a deployment change is detected, the system shall re-query hierarchy and attributes and provide the updated structure to the OPC UA server for address space rebuild.
On a deployment change, the Galaxy driver shall re-query hierarchy + attributes and stream the updated structure to the core for surgical namespace rebuild.
### Acceptance Criteria
- On change detection, re-query both hierarchy and attributes.
- Provide the new data set to the OPC UA server component for address space replacement.
- Log at Information level: "Galaxy deployment change detected. Rebuilding address space. ({ObjectCount} objects, {AttributeCount} attributes)".
- Log total rebuild time at Information level.
- If the re-query fails, log Error and keep the existing address space (do not clear it).
- On change signal, re-run `GR-001` (hierarchy) and `GR-002` (attributes) queries.
- Stream the new tree to the core via `IAddressSpaceBuilder` over the named pipe.
- Log at Information level: `"Galaxy deployment change detected. Rebuilding. ({ObjectCount} objects, {AttributeCount} attributes)"`.
- Log total rediscovery duration at Information level.
- On re-query failure: Error logged; existing Galaxy subtree is retained.
### Details
- Rebuild is not atomic from the DB perspective — hierarchy and attributes are two separate queries. This is acceptable; deployment is an infrequent operation.
- Raise an event/callback that the OPC UA server subscribes to: `OnGalaxyChanged(hierarchyData, attributeData)`.
- Rediscovery is not atomic from the DB perspective — hierarchy and attributes are two separate queries. Acceptable; Galaxy deployment is an infrequent operation.
- The core owns the diff/surgical apply per OPC-017; the Galaxy driver only streams the new authoritative tree.
---
## GR-005: Connection Configuration
Database connection parameters shall be configurable via appsettings.json (connection string using Windows Authentication by default).
Galaxy DB connection parameters shall be configurable via environment variables passed from the `OtOpcUa.Galaxy.Host` supervisor at spawn time.
### Acceptance Criteria
- Connection string in `appsettings.json` under `GalaxyRepository:ConnectionString`.
- Default: `Server=localhost;Database=ZB;Integrated Security=true` (Windows Auth).
- ADO.NET `SqlConnection` used for queries (.NET Framework 4.8 built-in).
- Connection string via `OTOPCUA_GALAXY_ZB_CONN` environment variable.
- Default: `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` (Windows Auth).
- ADO.NET `SqlConnection` used for queries (.NET Framework 4.8).
- Connection is opened per-query (not kept open). Connection pooling handles efficiency.
- If the initial connection test at startup fails, log Error with the connection string and continue attempting (change detection polls will keep retrying).
- If the initial connection test at startup fails, log Error with the connection string sanitized and continue attempting (change-detection polls keep retrying).
### Details
- Command timeout: configurable via `GalaxyRepository:CommandTimeoutSeconds`, default 30 seconds.
- No ORM. Raw ADO.NET with `SqlCommand` and `SqlDataReader`. SQL text is embedded as constants (not dynamically constructed).
- Command timeout: `Galaxy:CommandTimeoutSeconds` in Config DB driver JSON (default 30 seconds).
- No ORM. Raw ADO.NET with `SqlCommand` and `SqlDataReader`. SQL text embedded as constants.
---
## GR-006: Query Safety
All SQL queries shall be static read-only SELECT statements. No writes to the Galaxy Repository database.
All Galaxy SQL queries shall be static read-only SELECT statements. No writes to the Galaxy Repository database.
### Acceptance Criteria
@@ -112,10 +119,23 @@ All SQL queries shall be static read-only SELECT statements. No writes to the Ga
## GR-007: Startup Validation
On startup, the Galaxy Repository component shall validate database connectivity.
On startup, the Galaxy driver's DB component inside `OtOpcUa.Galaxy.Host` shall validate database connectivity.
### Acceptance Criteria
- Execute a simple test query (`SELECT 1`) against the configured database.
- If the database is unreachable, log an Error but do not prevent service startup.
- The service runs in degraded mode (empty address space) until the database becomes available and the next change detection poll succeeds.
- Execute a simple test query (`SELECT 1`) against the configured Galaxy DB.
- If the database is unreachable, log Error but do not prevent Host startup.
- The Galaxy driver runs in degraded mode (empty SystemPlatform namespace) until the database becomes available and the next change-detection poll succeeds.
- In degraded mode the Galaxy driver instance reports `DriverHealth.Unavailable`, causing its Polly circuit state to be open until the first successful discovery.
---
## GR-008: Capability Wrapping
All calls into the Galaxy DB component from the Proxy side shall route through `CapabilityInvoker.InvokeAsync(DriverCapability.Discover, …)`.
### Acceptance Criteria
- `Driver.Galaxy.Proxy.DiscoverAsync` is a thin capability-invoker call that sends a MessagePack request over the named pipe to the Host's DB component.
- Roslyn analyzer **OTOPCUA0001** validates there are no direct discovery calls bypassing the invoker.
- Polly pipeline for `DriverCapability.Discover` on the Galaxy driver instance carries Timeout + Retry + CircuitBreaker.

View File

@@ -1,47 +1,94 @@
# High-Level Requirements
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). The original 2025 text described a single-process Galaxy/MXAccess server called LmxOpcUa. Today the project is the **OtOpcUa** multi-driver OPC UA platform deployed as three cooperating processes (Server, Admin, Galaxy.Host). The Galaxy integration is one of seven shipped drivers. HLR-001 through HLR-008 have been rewritten driver-agnostically; HLR-009 has been retired (the embedded Status Dashboard is superseded by the Admin UI). HLR-010 through HLR-017 are new and cover plug-in drivers, resilience, Config DB / draft-publish, cluster redundancy, fleet-wide identifier uniqueness, Admin UI, audit logging, metrics, and the Roslyn capability-wrapping analyzer.
## HLR-001: OPC UA Server
The system shall expose an OPC UA server endpoint that OPC UA clients can connect to for browsing, reading, and writing Galaxy tag data.
The system shall expose an OPC UA server endpoint that OPC UA clients can connect to for browsing, reading, writing, subscribing, acknowledging alarms, and reading historical values. Data is sourced from one or more **driver instances** that plug into the common core; OPC UA clients see a single unified address space per endpoint regardless of how many drivers are active behind it.
## HLR-002: Galaxy Hierarchy as OPC UA Address Space
## HLR-002: Multi-Driver Plug-In Model
The system shall build an OPC UA address space that mirrors the System Platform Galaxy object hierarchy, using contained names for browse structure and tag names for runtime data access.
The system shall support pluggable driver modules that bind to specific data sources. v2.0 ships seven drivers: Galaxy (AVEVA System Platform via MXAccess), Modbus TCP (including DL205 via `AddressFormat=DL205`), Allen-Bradley CIP (ControlLogix/CompactLogix), Allen-Bradley Legacy (SLC/MicroLogix via PCCC), Siemens S7, Beckhoff TwinCAT (ADS), FANUC FOCAS, and OPC UA Client (aggregation/gateway). Drivers implement only the capability interfaces (`IDriver`, `ITagDiscovery`, `IReadable`, `IWritable`, `ISubscribable`, `IAlarmSource`, `IHistoryProvider`, `IHostConnectivityProbe`, `IPerCallHostResolver`, `IRediscoverable`) defined in `ZB.MOM.WW.OtOpcUa.Core.Abstractions` that apply to their protocol. Multiple instances of the same driver type are supported; each instance binds to its own OPC UA namespace index.
## HLR-003: MXAccess Runtime Data Access
## HLR-003: Address Space Composition per Namespace
The system shall use the MXAccess toolkit to subscribe to, read, and write Galaxy tag attribute values at runtime on behalf of connected OPC UA clients.
The system shall build the OPC UA address space by composing per-driver subtrees into a single endpoint. Each driver instance owns one namespace and registers its nodes via the core-provided `IAddressSpaceBuilder` streaming API. The Galaxy driver continues to mirror the deployed ArchestrA object hierarchy (contained-name browse paths) in a namespace of kind `SystemPlatform`. Native-protocol drivers populate a namespace of kind `Equipment` whose browse structure conforms to the canonical 5-level Unified Namespace (`Enterprise / Site / Area / Line / Equipment / Signal`).
## HLR-004: Data Type Mapping
The system shall map Galaxy attribute data types (mx_data_type) to appropriate OPC UA built-in types, including support for array attributes.
Each driver shall map its native data types to OPC UA built-in types via `DriverDataType` conversions, including support for arrays (ValueRank=1 with ArrayDimensions). Type mapping is driver-specific — `docs/DataTypeMapping.md` covers Galaxy/MXAccess; each other driver's spec in `docs/v2/driver-specs.md` covers its own mapping. Unknown/unmapped driver types shall default to String per the driver's spec.
## HLR-005: Dynamic Address Space Rebuild
## HLR-005: Live Data Access
The system shall detect Galaxy deployment changes (via `galaxy.time_of_last_deploy`) and rebuild the OPC UA address space to reflect the current deployed state.
For every data-path operation (read, write, subscribe notification, alarm event, history read, tag rediscovery, host connectivity probe), the system shall route the call through the capability interface owned by the target driver instance. Reads and subscriptions shall deliver a `DataValueSnapshot` carrying value, OPC UA `StatusCode`, and source timestamp regardless of the underlying protocol. Every async capability invocation at dispatch shall pass through `Core.Resilience.CapabilityInvoker`.
## HLR-006: Windows Service Hosting
## HLR-006: Change Detection and Rediscovery
The system shall run as a Windows service (via TopShelf) with support for install, uninstall, and interactive console modes.
Drivers whose backend has a native change signal (e.g. Galaxy's `time_of_last_deploy`, OPC UA Client receiving `ServerStatusChange`) shall implement the optional `IRediscoverable` interface so the core can rebuild only the affected subtree. Drivers whose tag set is static relative to a published config generation are not required to implement `IRediscoverable`; their address-space structure changes only via a new published Config DB generation (see HLR-012).
## HLR-007: Logging
## HLR-007: Service Hosting
The system shall log operational events to rolling daily log files using Serilog.
The system shall be deployed as three cooperating Windows services:
## HLR-008: Connection Resilience
- **OtOpcUa.Server** — .NET 10 x64, `Microsoft.Extensions.Hosting` + `AddWindowsService`, hosts all non-Galaxy drivers in-process and the OPC UA endpoint.
- **OtOpcUa.Admin** — .NET 10 x64 Blazor Server web app, hosts the admin UI, SignalR hubs for live updates, `/metrics` Prometheus endpoint, and audit log writers.
- **OtOpcUa.Galaxy.Host** — .NET Framework 4.8 x86 (TopShelf), hosts MXAccess COM + Galaxy Repository SQL + Historian plugin. Talks to `Driver.Galaxy.Proxy` inside `OtOpcUa.Server` via a named pipe (MessagePack over length-prefixed frames, per-process shared secret, SID-restricted ACL).
The system shall automatically reconnect to MXAccess after connection loss, replaying active subscriptions upon reconnect.
## HLR-008: Logging
## HLR-009: Status Dashboard
The system shall log operational events to rolling daily file sinks using Serilog on every process. Plain-text is on by default; structured JSON (CompactJsonFormatter) is opt-in via `Serilog:WriteJson = true` so SIEMs (Splunk, Datadog) can ingest without a regex parser.
The system shall host an embedded HTTP status dashboard (similar to the LmxProxy dashboard) providing at-a-glance operational visibility including connection state, health, subscription statistics, and operation metrics.
## HLR-009: Transport Security and Authentication
The system shall support configurable OPC UA transport-security profiles (`None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt`) resolved at startup by `SecurityProfileResolver`. UserName-token authentication shall be validated against LDAP (production: Active Directory; dev: GLAuth). The server certificate is always created even for `None`-only deployments because UserName token encryption depends on it.
## HLR-010: Per-Driver-Instance Resilience
Every async capability call at dispatch shall pass through `Core.Resilience.CapabilityInvoker`, which runs a Polly v8 pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`. Retry and circuit-breaker strategies are per capability per decision #143: Read / Discover / Probe / Subscribe / AlarmSubscribe / HistoryRead retry automatically; Write and AlarmAcknowledge do **not** retry unless the tag or capability is explicitly marked with `WriteIdempotentAttribute`. A driver-instance circuit-breaker trip sets Bad quality on that instance's nodes only; other drivers are unaffected (decision #144 — per-host Polly isolation).
## HLR-011: Config DB and Draft/Publish
Cluster topology, driver instances, namespaces, UNS hierarchy, equipment, tags, node ACLs, poll groups, and role grants shall live in a central MSSQL Config DB, not in `appsettings.json`. Changes accumulate in a draft generation that is validated and then atomically published. Each published generation gets a monotonically increasing `GenerationNumber` scoped per cluster. Nodes poll the DB for new published generations and diff-apply surgically against an atomic snapshot. `appsettings.json` is reduced to bootstrap-only fields (Config DB connection, NodeId, ClusterId, LDAP, security profile, redundancy role, logging, local cache path).
## HLR-012: Local Cache Fallback
Each node shall maintain a sealed LiteDB local cache of the most recent successfully applied generation. If the central Config DB is unreachable at startup, the node shall boot from its cached generation and log a warning. Cache reads are the Polly `Fallback` leg of the Config DB pipeline.
## HLR-013: Cluster Redundancy
The system shall support non-transparent OPC UA redundancy via 2-node clusters sharing a Config DB generation. `RedundancyCoordinator` + `ServiceLevelCalculator` compute a dynamic OPC UA `ServiceLevel` reflecting role (Primary/Secondary), publish state (current generation applied vs mid-apply), health (driver circuit-breaker state), and apply-lease state. Clients select an endpoint by `ServerUriArray` + `ServiceLevel` per the OPC UA spec; there is no VIP or load balancer. Single-node deployments use the same model with `NodeCount = 1`.
## HLR-014: Fleet-Wide Identifier Uniqueness
Equipment identifiers that integrate with external systems (`ZTag` for ERP, `SAPID` for SAP PM) shall be unique fleet-wide (across all clusters), not just within a cluster. The Admin UI enforces this at draft-publish time via the `ExternalIdReservation` table, which reserves external IDs across clusters so two clusters cannot publish the same ZTag or SAPID. `EquipmentUuid` is immutable and globally unique (UUIDv4). `EquipmentId` and `MachineCode` are unique within a cluster.
## HLR-015: Admin UI Operator Surface
The system shall provide a Blazor Server Admin UI (`OtOpcUa.Admin`) as the sole write path into the Config DB. Capabilities include: cluster + node management, driver-instance CRUD with schemaless JSON editors, UNS drag-and-drop hierarchy editor, CSV-driven equipment import with fleet-wide external-id reservation, draft/publish with a 6-section diff viewer (Drivers / Namespaces / UNS / Equipment / Tags / ACLs), node-ACL editor producing a permission trie, LDAP role grants, redundancy tab, live cluster-generation state via SignalR, audit log viewer. Users authenticate via cookie-auth over LDAP bind; three admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`) gate UI operations.
## HLR-016: Audit Logging
Every publish event and every ACL / role-grant change shall produce an immutable audit log row in the Config DB via `AuditLogService` with the acting principal, timestamp, action, before/after generation numbers, and affected entity ids. Audit rows are never mutated or deleted.
## HLR-017: Prometheus Metrics
The Admin service shall expose a `/metrics` endpoint using OpenTelemetry → Prometheus. Core / Server shall emit driver health (per `DriverInstanceId`), Polly circuit-breaker states (per `DriverInstanceId` + `HostName` + `DriverCapability`), capability-call duration histograms, subscription counts, session counts, memory-tracking gauges (Phase 6.1), publish durations, and Config-DB apply-status gauges.
## HLR-018: Roslyn Analyzer OTOPCUA0001
All direct call sites to capability-interface methods (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync`, `IAlarmSource.SubscribeAlarmsAsync` / `AcknowledgeAsync`, `IHistoryProvider.*`, `IHostConnectivityProbe.*`) made outside `Core.Resilience.CapabilityInvoker` shall produce Roslyn diagnostic **OTOPCUA0001** at build time. The analyzer is shipped in `ZB.MOM.WW.OtOpcUa.Analyzers` and referenced by every project that could host a capability call, guaranteeing that resilience cannot be accidentally bypassed.
## Retired HLRs
- **HLR-009 (Status Dashboard)** — retired. Superseded by the Admin UI (HLR-015). See `docs/v2/admin-ui.md`.
## Component-Level Requirements
Detailed requirements are broken out into the following documents:
- [OPC UA Server Requirements](OpcUaServerReqs.md)
- [MXAccess Client Requirements](MxAccessClientReqs.md)
- [Galaxy Repository Requirements](GalaxyRepositoryReqs.md)
- [Service Host Requirements](ServiceHostReqs.md)
- [Status Dashboard Requirements](StatusDashboardReqs.md)
- [Galaxy Driver — Repository Requirements](GalaxyRepositoryReqs.md) (Galaxy driver only)
- [Galaxy Driver — MXAccess Client Requirements](MxAccessClientReqs.md) (Galaxy driver only)
- [Service Host Requirements](ServiceHostReqs.md) (all three processes)
- [Client Requirements](ClientRequirements.md) (Client CLI + Client UI)
- [Status Dashboard Requirements](StatusDashboardReqs.md) (retired — pointer only)

View File

@@ -1,6 +1,10 @@
# MXAccess Client — Component Requirements
# Galaxy Driver — MXAccess Client Requirements
Parent: [HLR-003](HighLevelReqs.md#hlr-003-mxaccess-runtime-data-access), [HLR-008](HighLevelReqs.md#hlr-008-connection-resilience)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). Scope narrowed: this document covers the MXAccess surface **inside `OtOpcUa.Galaxy.Host`** (.NET Framework 4.8 x86 Windows service). The in-server `Driver.Galaxy.Proxy` implements the `IReadable` / `IWritable` / `ISubscribable` / `IAlarmSource` / `IHistoryProvider` capability interfaces and routes every wire call through the named pipe to this Host process. The STA thread + reconnect playback + subscription refcount requirements from v1 are preserved; what changed is where they live (Host service, not the Server process). MXA-010 (proxy-side wrapping) and MXA-011 (pipe ACL / shared secret) are new.
Parent: [HLR-002](HighLevelReqs.md#hlr-002-multi-driver-plug-in-model), [HLR-005](HighLevelReqs.md#hlr-005-live-data-access), [HLR-007](HighLevelReqs.md#hlr-007-service-hosting)
Driver scope: Galaxy only. Process scope: `OtOpcUa.Galaxy.Host` (Host side) and `Driver.Galaxy.Proxy` (server-side forwarder).
## MXA-001: STA Thread with Message Pump
@@ -8,165 +12,194 @@ All MXAccess COM objects shall be created and called on a dedicated STA thread r
### Acceptance Criteria
- A dedicated thread is created with `ApartmentState.STA` before any MXAccess COM objects are instantiated.
- The thread runs a Win32 message pump using `GetMessage`/`TranslateMessage`/`DispatchMessage` loop.
- A dedicated thread is created with `ApartmentState.STA` before any MXAccess COM object is instantiated; implementation lives in `StaPump` inside `OtOpcUa.Galaxy.Host`.
- The thread runs a Win32 message pump using `GetMessage` / `TranslateMessage` / `DispatchMessage`.
- Work items are marshalled to the STA thread via `PostThreadMessage(WM_APP)` and a concurrent queue.
- The STA thread processes work items between message pump iterations.
- All COM object creation (`LMXProxyServer` constructor), method calls, and event callbacks happen on this thread.
- All COM object creation (`LMXProxyServer`), method calls, and event callbacks happen on this thread.
- Thread name `Galaxy.Sta` (for diagnostics).
### Details
- Thread name: `MxAccess-STA` (for diagnostics).
- If the STA thread dies unexpectedly, log Fatal and trigger service shutdown. Do not attempt to create a replacement thread (COM objects on the dead thread are unrecoverable).
- `RunAsync(Action)` method returns a `Task` that completes when the action executes on the STA thread. Callers can `await` it.
- If the STA thread dies unexpectedly, log Fatal and trigger Host service shutdown. The supervisor restarts the Host under its driver-stability policy (`docs/v2/driver-stability.md`). COM objects on the dead thread are unrecoverable; no in-process recovery is attempted.
- `RunAsync(Action)` returns a `Task` that completes when the action executes on the STA thread. Callers can `await` it.
---
## MXA-002: Connection Lifecycle
The client shall support Register/Unregister lifecycle with the LMXProxyServer COM object, tracking the connection handle.
The Host shall support Register/Unregister lifecycle with the `LMXProxyServer` COM object, tracking the connection handle.
### Acceptance Criteria
- `Register(clientName)` is called on the STA thread and returns a positive connection handle on success.
- If Register returns handle <= 0, throw with descriptive error.
- Handle ≤ 0 → descriptive error thrown; Host reports `DriverHealth.Unavailable` via the pipe so the Proxy reports Bad quality to the core.
- `Unregister(handle)` is called during disconnect after all subscriptions are removed.
- Client name: configurable via `MxAccess:ClientName`, default `LmxOpcUa`. Must be unique per MXAccess registration.
- Client name comes from `OTOPCUA_GALAXY_CLIENT_NAME` environment variable; default `OtOpcUa-Galaxy.Host`. Must be unique per MXAccess registration (a cluster's Primary and Secondary each get their own client-name suffix via node override).
- Connection state transitions: Disconnected → Connecting → Connected → Disconnecting → Disconnected (and Error from any state).
### Details
- `ConnectedSince` timestamp (UTC) is recorded after successful Register.
- `ReconnectCount` is tracked for diagnostics and dashboard display.
- State change events are raised for dashboard and health check consumption.
- `ConnectedSince` (UTC) recorded after successful Register.
- `ReconnectCount` tracked for diagnostics and `/metrics`.
- State changes are emitted over the pipe as `DriverHealth` updates.
---
## MXA-003: Tag Subscription
The client shall support subscribing to tags via AddItem + AdviseSupervisory, receiving value updates through OnDataChange callbacks.
The Host shall support subscribing to tags via AddItem + AdviseSupervisory, receiving value updates through OnDataChange callbacks.
### Acceptance Criteria
- Subscribe sequence: `AddItem(handle, address)` returns item handle, then `AdviseSupervisory(handle, itemHandle)` starts the subscription.
- `OnDataChange` callback delivers value, quality (integer), timestamp, and MXSTATUS_PROXY array.
- `OnDataChange` callback delivers value, quality, timestamp, and MXSTATUS_PROXY array.
- Item address format: `tag_name.AttributeName` for scalars, `tag_name.AttributeName[]` for whole arrays.
- If AddItem fails (e.g., tag does not exist), log Warning and return failure to caller.
- Bidirectional maps of `address ↔ itemHandle` are maintained for callback resolution.
- AddItem failure → Warning logged, failure propagated over the pipe to the Proxy.
- Bidirectional maps of `address ↔ itemHandle` maintained for callback resolution.
- Multi-client refcounting: two Proxy-side subscribe calls for the same address produce one MXAccess subscription; refcount decrement on the last unsubscribe triggers `UnAdvise` / `RemoveItem`.
### Details
- Use `AdviseSupervisory` (not `Advise`) because this is a background service with no interactive user session. AdviseSupervisory allows secured/verified writes without user authentication.
- Stored subscriptions dictionary maps address to callback for reconnect replay.
- On reconnect, all entries in stored subscriptions are re-subscribed (AddItem + AdviseSupervisory with new handles).
- `AdviseSupervisory` (not `Advise`) is used because this is a background service without an interactive user session.
- Stored subscriptions dictionary maps address callback for reconnect replay.
- On reconnect, every entry in stored subscriptions is re-subscribed (AddItem + AdviseSupervisory with new handles).
---
## MXA-004: Tag Read/Write
The client shall support synchronous-style read and write operations, marshalled to the STA thread, with configurable timeouts.
The Host shall support synchronous-style read and write operations, marshalled to the STA thread, with configurable timeouts.
### Acceptance Criteria
- Read: implemented as subscribe-get-first-value-unsubscribe pattern (AddItem → AdviseSupervisory → wait for OnDataChange → UnAdvise → RemoveItem).
- Read pattern: prefer cached subscription value; fall back to subscribe-get-first-value-unsubscribe (AddItem → AdviseSupervisory → wait for OnDataChange → UnAdvise → RemoveItem).
- Write: AddItem → AdviseSupervisory → `Write()` → await `OnWriteComplete` callback → cleanup.
- Read timeout: configurable via `MxAccess:ReadTimeoutSeconds`, default 5 seconds.
- Write timeout: configurable via `MxAccess:WriteTimeoutSeconds`, default 5 seconds. On timeout, log Warning and return timeout error.
- Concurrent operation limit: configurable semaphore via `MxAccess:MaxConcurrentOperations`, default 10.
- Read timeout: `Galaxy:ReadTimeoutSeconds` in driver config (default 5 seconds) — enforced on the Host side in addition to the Proxy-side Polly `Timeout` leg.
- Write timeout: `Galaxy:WriteTimeoutSeconds` (default 5 seconds) — enforced similarly.
- Concurrent operation limit: configurable semaphore (`Galaxy:MaxConcurrentOperations`, default 10).
- All operations marshalled to the STA thread.
### Details
- Write uses security classification -1 (no security). Galaxy runtime handles security enforcement.
- `OnWriteComplete` callback: check MXSTATUS_PROXY `success` field. If 0, extract detail code and propagate error.
- COM exceptions (`COMException` with HRESULT) are caught and translated to meaningful error messages.
- Write uses security classification `-1` (no security). Galaxy runtime enforces security; OtOpcUa authorization is enforced server-side before the call ever reaches the pipe (per OPC-014 `AuthorizationGate`).
- `OnWriteComplete`: check `MXSTATUS_PROXY.success`. If 0, extract detail code and propagate as an error over the pipe.
- COM exceptions translated to meaningful error messages.
---
## MXA-005: Auto-Reconnect
The client shall monitor connection health and automatically reconnect on failure, replaying all stored subscriptions after reconnect.
The Host shall monitor connection health and automatically reconnect on failure, replaying all stored subscriptions after reconnect.
### Acceptance Criteria
- Monitor loop runs on a background thread, checking connection health at configurable interval (`MxAccess:MonitorIntervalSeconds`, default 5 seconds).
- If disconnected, attempt reconnect. On success, replay all stored subscriptions.
- On reconnect failure, log Warning and retry at next interval (no exponential backoff — reconnect as quickly as possible on a plant-floor service).
- Monitor loop runs on a background thread at `Galaxy:MonitorIntervalSeconds` (default 5 seconds).
- On disconnect, attempt reconnect. On success, replay all stored subscriptions.
- On reconnect failure, log Warning and retry at next interval (no exponential backoff inside the Host; the Proxy-side Polly pipeline handles cross-process backoff against pipe failures).
- Reconnect count is incremented on each successful reconnect.
- Monitor loop is cancellable (for clean shutdown).
- Monitor loop is cancellable for clean Host shutdown.
### Details
- Reconnect cleans up old COM objects before creating new ones.
- After reconnect, probe subscription is re-established first, then stored subscriptions.
- No max retry limit — keep trying indefinitely until service is stopped.
- After reconnect, probe subscription (MXA-006) is re-established first, then stored subscriptions.
- No max retry limit — keep trying indefinitely until the Host service is stopped.
---
## MXA-006: Probe-Based Health Monitoring
The client shall optionally subscribe to a configurable probe tag and use OnDataChange callback staleness to detect silent connection failures.
The Host shall optionally subscribe to a configurable probe tag and use OnDataChange callback staleness to detect silent connection failures.
### Acceptance Criteria
- Subscribe to a configurable probe tag (a known-good Galaxy attribute that changes periodically).
- Probe tag address configured via `Galaxy:ProbeTag`. If unset, probe monitoring is disabled.
- Track `_lastProbeValueTime` (UTC) updated on each OnDataChange for the probe tag.
- If `DateTime.UtcNow - _lastProbeValueTime > staleThreshold`, force disconnect and reconnect.
- Probe tag address: configurable via `MxAccess:ProbeTag`. If not configured, probe monitoring is disabled.
- Stale threshold: configurable via `MxAccess:ProbeStaleThresholdSeconds`, default 60 seconds.
- Stale threshold: `Galaxy:ProbeStaleThresholdSeconds` (default 60 seconds).
- Implements `IHostConnectivityProbe` on the Proxy side so the core's `CapabilityInvoker` records probe outcomes with `DriverCapability.Probe` telemetry.
### Details
- The probe tag should be an attribute that the Galaxy runtime updates regularly (e.g., a platform heartbeat or area-level timestamp). The specific tag is site-dependent.
- After forced reconnect, reset `_lastProbeValueTime` to `DateTime.UtcNow` to give the new connection a full threshold window.
- The probe tag should be an attribute the Galaxy runtime updates regularly (platform heartbeat, area timestamp). Specific tag is site-dependent.
- After forced reconnect, reset `_lastProbeValueTime` to `DateTime.UtcNow`.
---
## MXA-007: COM Cleanup
On disconnect or disposal, the client shall unwire event handlers, unadvise/remove all items, unregister, and release COM objects via Marshal.ReleaseComObject.
On disconnect or disposal, the Host shall unwire event handlers, unadvise/remove all items, unregister, and release COM objects via `Marshal.ReleaseComObject`.
### Acceptance Criteria
- Cleanup order: UnAdvise all active subscriptions → RemoveItem all items → unwire OnDataChange and OnWriteComplete event handlers → Unregister → `Marshal.ReleaseComObject`.
- Cleanup order: UnAdvise all active subscriptions → RemoveItem all items → unwire OnDataChange and OnWriteComplete handlers → Unregister → `Marshal.ReleaseComObject`.
- On dispose: run disconnect if still connected, then dispose STA thread.
- Each cleanup step is wrapped in try/catch (cleanup must not throw).
- After cleanup: handle maps are cleared, pending write TCS entries are abandoned, COM reference is set to null.
- Each cleanup step wrapped in try/catch (cleanup must not throw).
- After cleanup: handle maps cleared, pending write TCS entries abandoned, COM reference set to null.
### Details
- `_storedSubscriptions` is NOT cleared on disconnect (preserved for reconnect replay). Only cleared on Dispose.
- Event handlers must be unwired BEFORE Unregister, or callbacks may fire on a dead object.
- `Marshal.ReleaseComObject` in a finally block, always, even if earlier steps fail.
- Stored subscriptions are NOT cleared on disconnect (preserved for reconnect replay). Only cleared on Dispose.
- Event handlers unwired BEFORE Unregister (else callbacks may fire on a dead object).
- `Marshal.ReleaseComObject` in a `finally` block, always.
---
## MXA-008: Operation Metrics
The MXAccess client shall record timing and success/failure for Read, Write, and Subscribe operations.
The MXAccess Host shall record timing and success/failure for Read, Write, and Subscribe operations.
### Acceptance Criteria
- Each operation records: duration (ms), success/failure.
- Metrics are available for the status dashboard: count, success rate, avg/min/max/P95 latency.
- Uses a rolling 1000-entry buffer for percentile calculation.
- Metrics are exposed via a queryable interface consumed by the status report service.
### Details
- Uses an `ITimingScope` pattern: `using (var scope = metrics.BeginOperation("read")) { ... }` for automatic timing and success tracking.
- Metrics are periodically logged at Debug level for diagnostics.
- Each operation records duration (ms) + success/failure.
- Metrics exposed over the pipe to the Proxy, which re-publishes them via OpenTelemetry → Prometheus under `DriverInstanceId = "galaxy-*"`, `HostName = "galaxy.host"`.
- Rolling 1000-entry buffer for percentile calculation.
- Uses an `ITimingScope` pattern: `using (var scope = metrics.BeginOperation("read")) { ... }`.
---
## MXA-009: Error Code Translation
The client shall translate known MXAccess error codes from MXSTATUS_PROXY.detail into human-readable messages for logging and OPC UA status propagation.
The Host shall translate known MXAccess error codes from `MXSTATUS_PROXY.detail` into human-readable messages for logging and OPC UA status propagation.
### Acceptance Criteria
- Error 1008 → "User lacks security permission"
- Error 1012 → "Secured write required (one signature)"
- Error 1013 → "Verified write required (two signatures)"
- Unknown error codes are logged with their numeric value.
- Translated messages are included in OPC UA StatusCode descriptions and log entries.
- Unknown error codes logged with their numeric value.
- Translated messages flow back through the pipe and surface in OPC UA `StatusCode` descriptions and Server logs.
- Errors 1008 / 1012 / 1013 on write operations map to `Bad_UserAccessDenied` at the OPC UA surface.
---
## MXA-010: Proxy-Side Capability Wrapping
`Driver.Galaxy.Proxy` shall implement the capability interfaces as thin forwarders that serialize every call through the named pipe and route every call through `CapabilityInvoker`.
### Acceptance Criteria
- `Driver.Galaxy.Proxy` implements `IDriver` + `IReadable` + `IWritable` + `ISubscribable` + `ITagDiscovery` + `IRediscoverable` + `IAlarmSource` + `IHistoryProvider` + `IHostConnectivityProbe`.
- Each implementation uses `CapabilityInvoker.InvokeAsync(DriverCapability.<...>, …)` — direct pipe calls bypassing the invoker are caught by Roslyn **OTOPCUA0001**.
- Each method serializes a MessagePack request frame, sends over the pipe, awaits the response frame, deserializes, returns.
- Pipe disconnect mid-call → `CapabilityInvoker`'s circuit breaker counts the failure; sustained disconnect opens the circuit and Galaxy nodes surface Bad quality until the pipe reconnects.
- Proxy tolerates Host service restarts — it automatically reconnects and replays subscription setup (parallel to MXA-005 but across the IPC boundary).
---
## MXA-011: Pipe Security
The named pipe between Proxy and Host shall be restricted to the Server's runtime principal via SID-based ACL and authenticated with a per-process shared secret.
### Acceptance Criteria
- Pipe name from `OTOPCUA_GALAXY_PIPE` environment variable; default `OtOpcUaGalaxy`.
- Allowed SID passed as `OTOPCUA_ALLOWED_SID` — only the declared principal (typically the Server service account) can open the pipe; `Administrators` is explicitly NOT granted (per the `project_galaxy_host_installed` memory note).
- Shared secret passed via `OTOPCUA_GALAXY_SECRET` at spawn time; the Proxy must present the matching secret on the opening handshake.
- Secret is process-scoped (regenerated per Host restart) and never persisted to disk or Config DB.
- Pipe ACL denials are logged as Warning with the rejected principal SID.
### Details
- Environment variables are passed by the supervisor launching the Host (`docs/v2/driver-stability.md`).
- Dev-box secret is stored at `.local/galaxy-host-secret.txt` for NSSM-wrapped development runs (memory note: `project_galaxy_host_installed`).

View File

@@ -1,234 +1,266 @@
# OPC UA Server — Component Requirements
Parent: [HLR-001](HighLevelReqs.md#hlr-001-opc-ua-server), [HLR-002](HighLevelReqs.md#hlr-002-galaxy-hierarchy-as-opc-ua-address-space), [HLR-004](HighLevelReqs.md#hlr-004-data-type-mapping)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). OPC-001…OPC-013 have been rewritten driver-agnostically — they now describe how the core OPC UA server composes multiple driver subtrees, enforces authorization, and invokes capabilities through the Polly-wrapped dispatch path. OPC-014 through OPC-022 are new and cover capability dispatch, per-host Polly isolation, idempotence-aware write retry, `AuthorizationGate`, `ServiceLevel` reporting, the alarm surface, history surface, server-certificate management, and the transport-security profile matrix. Galaxy-specific behavior has been moved out to `GalaxyRepositoryReqs.md` and `MxAccessClientReqs.md`.
Parent: [HLR-001](HighLevelReqs.md#hlr-001-opc-ua-server), [HLR-003](HighLevelReqs.md#hlr-003-address-space-composition-per-namespace), [HLR-009](HighLevelReqs.md#hlr-009-transport-security-and-authentication), [HLR-010](HighLevelReqs.md#hlr-010-per-driver-instance-resilience), [HLR-013](HighLevelReqs.md#hlr-013-cluster-redundancy)
## OPC-001: Server Endpoint
The OPC UA server shall listen on a configurable TCP port (default 4840) using the OPC Foundation .NET Standard stack.
The OPC UA server shall listen on a configurable TCP endpoint using the OPC Foundation .NET Standard stack and expose a single endpoint URL per cluster node.
### Acceptance Criteria
- Server starts and accepts TCP connections on the configured port.
- Port is read from `appsettings.json` under `OpcUa:Port`; defaults to 4840 if absent.
- Endpoint URL format: `opc.tcp://<hostname>:<port>/LmxOpcUa`.
- If the port is in use at startup, log an Error and fail to start (do not silently pick another port).
- Security policy: None (no certificate validation). This is an internal plant-floor service.
- Endpoint URL comes from `ClusterNode.EndpointUrl` in the Config DB (default form `opc.tcp://<hostname>:<port>/OtOpcUa`).
- `ApplicationName` and `ApplicationUri` come from `ClusterNode` fields; `ApplicationUri` is unique per node so redundancy `ServerUriArray` entries are distinguishable.
- Port defaults to 4840. If the port is in use at startup the server shall log Error and fail to start (no silent port reassignment).
- Uses `OPCFoundation.NetStandard.Opc.Ua.Server` NuGet.
- Endpoint URL logged at Information level on startup.
### Details
- Configurable items: port (default 4840), endpoint path (default `/LmxOpcUa`), server application name (default `LmxOpcUa`).
- Server shall use the `OPCFoundation.NetStandard.Opc.Ua.Server` NuGet package.
- On startup, log the endpoint URL at Information level.
- Node-local `appsettings.json` only carries the `Config DB connection + NodeId + ClusterId` bootstrap — actual endpoint topology comes from the Config DB per HLR-011.
---
## OPC-002: Address Space Structure
## OPC-002: Address Space Composition
The server shall create folder nodes for areas and object nodes for automation objects, organized in the same parent-child hierarchy as the Galaxy.
The server shall compose an address space by mounting each active driver instance's subtree under a dedicated OPC UA namespace.
### Acceptance Criteria
- The root folder node has BrowseName `ZB` (hardcoded Galaxy name).
- Objects where `is_area = 1` are created as FolderType nodes (organizational).
- Objects where `is_area = 0` are created as BaseObjectType nodes.
- Parent-child relationships use Organizes references (for areas) and HasComponent references (for contained objects).
- A client browsing Root → Objects → ZB → DEV → TestArea → TestMachine_001 → DelmiaReceiver sees the same structure as `gr/layout.md`.
### Details
- NodeIds use a string-based identifier scheme: `ns=1;s=<tag_name>` for object nodes, `ns=1;s=<tag_name>.<attribute_name>` for variable nodes.
- Infrastructure objects (AppEngines, Platforms) are included in the tree but may have no variable children.
- When `contained_name` is null or empty, fall back to `tag_name` as the BrowseName.
- Each `DriverInstance` in the current published generation registers one `IDriver` implementation in the core.
- Each driver's `ITagDiscovery.DiscoverAsync` result is streamed into the core via `IAddressSpaceBuilder``AddFolder` / `AddVariable` calls; the driver does not buffer the whole tree.
- Each driver instance gets its own namespace index; `NamespaceUri` comes from the `Namespace` row in the Config DB.
- Each cluster has at most one namespace per `Kind` (`Equipment`, `SystemPlatform`, future `Simulated`); enforced by UNIQUE on `(ClusterId, Kind)` in the DB.
- Galaxy driver subtree preserves the contained-name browse structure from the deployed Galaxy (moved to `GalaxyRepositoryReqs.md`).
- Equipment-kind drivers populate the canonical 5-level UNS structure (`Enterprise/Site/Area/Line/Equipment/Signal`).
---
## OPC-003: Variable Nodes for Attributes
## OPC-003: Variable Nodes and Access Levels
Each user-defined attribute on a deployed object shall be represented as an OPC UA variable node under its parent object node.
Each tag produced by a driver's `ITagDiscovery` shall become an OPC UA variable node.
### Acceptance Criteria
- Each row from `attributes.sql` creates one variable node under the matching object node (matched by `gobject_id`).
- Variable node BrowseName and DisplayName are set to `attribute_name`.
- Variable node stores `full_tag_reference` as its runtime MXAccess address.
- Variable node AccessLevel is set based on the attribute's `security_classification` per the mapping in `gr/data_type_mapping.md`.
- FreeAccess (0), Operate (1), Tune (4), Configure (5) → AccessLevel = CurrentRead | CurrentWrite (3).
- SecuredWrite (2), VerifiedWrite (3), ViewOnly (6) → AccessLevel = CurrentRead (1).
- Objects with no user-defined attributes still appear as object nodes with zero children.
### Details
- Security classification determines the OPC UA AccessLevel and UserAccessLevel attributes on each variable node. The OPC UA stack enforces read-only access for nodes with CurrentRead-only access level.
- Attributes whose names start with `_` are already filtered by the SQL query.
- Variable node `BrowseName` and `DisplayName` come from `DriverAttributeInfo`.
- `DataType` is resolved from `DriverDataType` per each driver's spec in `docs/v2/driver-specs.md`.
- `AccessLevel` and `UserAccessLevel` are derived from the tag's `SecurityClassification` and the session's effective permissions walked through the node-ACL permission trie (see OPC-017 `AuthorizationGate`).
- Scalar attributes produce `ValueRank = Scalar`; array attributes produce `ValueRank = OneDimension` with `ArrayDimensions` set from the driver's attribute info.
---
## OPC-004: Browse Name Translation
## OPC-004: Namespace Index Allocation
Browse names shall use contained names (human-readable, scoped to parent). The server shall internally translate browse paths to tag_name references for MXAccess operations.
The server shall register one OPC UA namespace per active driver instance.
### Acceptance Criteria
- A variable node browsed as `ZB/DEV/TestArea/TestMachine_001/DelmiaReceiver/DownloadPath` correctly translates to MXAccess reference `DelmiaReceiver_001.DownloadPath`.
- Translation uses the `tag_name` stored on the parent object node, not the browse path.
- No runtime path parsing — the mapping is baked into each node at build time.
### Details
- Each variable node stores its `full_tag_reference` (e.g., `DelmiaReceiver_001.DownloadPath`) at address-space build time. Read/write operations use this stored reference directly.
- Namespace index 0 remains the standard OPC UA namespace.
- Each driver instance's `Namespace.Uri` becomes a registered namespace; its index is assigned deterministically at startup from the published generation's driver ordering.
- All variable NodeIds use the driver's namespace index; NodeId identifiers are string-shaped and stable across restarts of the same generation.
- Namespace index reshuffles are a publish-time concern; clients reconciling server-relative NodeIds must re-resolve namespace URIs after a new generation is applied.
---
## OPC-005: Data Type Mapping
## OPC-005: Read Operations
Variable nodes shall use OPC UA data types mapped from Galaxy mx_data_type values per the mapping in `gr/data_type_mapping.md`.
The server shall fulfill OPC UA `Read` requests by invoking `IReadable.ReadAsync` on the target driver instance, dispatched through `CapabilityInvoker`.
### Acceptance Criteria
- Every `mx_data_type` value in the mapping table produces the correct OPC UA DataType NodeId on the variable node.
- Unknown/unmapped `mx_data_type` values default to String (i=12).
- ElapsedTime (type 7) maps to Double representing seconds.
### Details
- Full mapping table in `gr/data_type_mapping.md`.
- DateTime conversion: Galaxy may store local time; convert to UTC for OPC UA.
- LocalizedText (type 15): use empty locale string with the text value.
- Every read call at dispatch passes through `Core.Resilience.CapabilityInvoker.InvokeAsync(DriverCapability.Read, …)`.
- Returned `DataValueSnapshot` is converted to an OPC UA `DataValue` with `StatusCode`, source timestamp, and server timestamp.
- If the owning driver instance's Polly circuit is open, the read returns Bad quality immediately without hitting the wire.
- Reads on a node the session has no `Read` bit for in the permission trie return `Bad_UserAccessDenied` before the capability is invoked (OPC-017).
- Read timeout is the Polly timeout leg on the `Read` capability; its duration is per-`(DriverInstanceId, HostName)` and comes from the Config DB.
---
## OPC-006: Array Support
## OPC-006: Write Operations
Attributes marked as arrays shall have ValueRank=1 and ArrayDimensions set to the attribute's array_dimension value.
The server shall fulfill OPC UA `Write` requests by invoking `IWritable.WriteAsync` through `CapabilityInvoker` with **idempotence-aware** retry policy.
### Acceptance Criteria
- `is_array = 1` produces ValueRank = 1 (OneDimension) and ArrayDimensions = `[array_dimension]`.
- `is_array = 0` produces ValueRank = -1 (Scalar) and no ArrayDimensions.
- MXAccess reference for array attributes uses `tag_name.attribute[]` (whole array) format.
### Details
- Individual array element access (`tag_name.attribute[n]`) is not required for initial implementation. Whole-array read/write only.
- If `array_dimension` is null or 0 when `is_array = 1`, log a Warning and default to ArrayDimensions = [0] (variable-length).
- Writes dispatch through `CapabilityInvoker.InvokeAsync(DriverCapability.Write, …)`.
- Writes **do not auto-retry** unless the tag's `TagConfig.WriteIdempotent = true`, or the driver's capability is marked with `[WriteIdempotent]` (decision #143).
- Writes on a node the session lacks the required permission bit for (`WriteOperate`, `WriteTune`, or `WriteConfigure` derived from the tag's `SecurityClassification`) return `Bad_UserAccessDenied` before the capability runs.
- A write into an open circuit returns a driver-shaped error (`Bad_NoCommunication` / `Bad_ServerNotConnected`) without hitting the wire.
- The server shall coerce the written OPC UA value to the driver's expected native type using the node's `DriverDataType` before calling `WriteAsync`.
- Writes to a NodeId not currently in the address space return `Bad_NodeIdUnknown`.
---
## OPC-007: Read Operations
## OPC-007: Subscriptions and Monitored Items
The server shall fulfill OPC UA Read requests by reading the corresponding tag value from MXAccess using the tag_name.AttributeName reference.
The server shall map OPC UA `CreateMonitoredItems` / `DeleteMonitoredItems` to `ISubscribable.SubscribeAsync` / `UnsubscribeAsync` on the owning driver instance.
### Acceptance Criteria
- OPC UA Read request for a variable node results in a read via MXAccess using the node's stored `full_tag_reference`.
- Returned value is converted from the COM variant to the OPC UA data type specified on the node.
- OPC UA StatusCode reflects MXAccess quality: Good maps to Good, Bad/Uncertain map appropriately.
- If MXAccess is not connected, return StatusCode = Bad_NotConnected.
- Read timeout: configurable, default 5 seconds. On timeout, return Bad_Timeout.
### Details
- Prefer cached subscription-delivered values over on-demand reads to reduce COM round-trips.
- If no subscription is active for the tag, perform an on-demand read (AddItem, AdviseSupervisory, wait for first OnDataChange, then UnAdvise/RemoveItem).
- Concurrency: semaphore-limited to configurable max (default 10) concurrent MXAccess operations.
- Subscription setup dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.Subscribe, …)`.
- Two OPC UA monitored items against the same tag produce exactly one driver-side subscription (ref-counted); last unsubscribe releases the driver-side resource.
- `OnDataChange` callbacks from the driver arrive as `DataValueSnapshot` and are forwarded to all OPC UA monitored items on that tag.
- Driver-side quality maps to OPC UA `StatusCode` per the driver's spec.
- When the owning driver's circuit opens, subscribed items publish Bad quality; when it resets, resumption publishes the cached or freshly-sampled value.
- Across generation applies that preserve a tag's NodeId, existing OPC UA monitored items are preserved (no re-subscribe required on the client).
---
## OPC-008: Write Operations
## OPC-008: Alarm Surface
The server shall fulfill OPC UA Write requests by writing to the corresponding tag via MXAccess.
The server shall expose the OPC UA alarm and condition model backed by each driver's `IAlarmSource` (where implemented).
### Acceptance Criteria
- OPC UA Write request results in an MXAccess `Write()` call with completion confirmed via `OnWriteComplete()` callback.
- Write timeout: configurable, default 5 seconds. On timeout, log Warning and return Bad_Timeout.
- MXSTATUS_PROXY with `success = 0` causes the OPC UA write to return Bad_InternalError with the detail message.
- MXAccess errors 1008 (no permission), 1012 (secured write), 1013 (verified write) return Bad_UserAccessDenied.
- Write to a non-existent tag returns Bad_NodeIdUnknown.
- The server shall attempt to convert the written value to the expected Galaxy data type before passing to Write().
### Details
- Write uses security classification -1 (no security). Galaxy runtime handles security enforcement.
- Write sequence: uses existing subscription handle if available, otherwise AddItem + AdviseSupervisory + Write + await OnWriteComplete + cleanup.
- Concurrent write limit: same semaphore as reads (configurable, default 10).
- Drivers implementing `IAlarmSource` (today: Galaxy, FOCAS, OPC UA Client) produce alarm events that the core maps onto OPC UA `ConditionType` / `AlarmConditionType` instances in the driver's namespace.
- `AlarmSubscribe` dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.AlarmSubscribe, …)` and retries on transient failure.
- `AlarmAcknowledge` from the OPC UA client dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.AlarmAcknowledge, …)` and **does not retry** (decision #143 — ack is a write-shaped operation).
- Alarm-ack requires the `AlarmAck` permission bit for the tag / equipment node; otherwise `Bad_UserAccessDenied`.
- Drivers that do not implement `IAlarmSource` contribute no alarm nodes; the core does not synthesize placeholder conditions.
---
## OPC-009: Subscriptions
## OPC-009: Historical Access
The server shall support OPC UA subscriptions by mapping them to MXAccess advisory subscriptions and forwarding data change notifications.
The server shall surface OPC UA Historical Access (HA) via each driver's `IHistoryProvider` (where implemented).
### Acceptance Criteria
- OPC UA CreateMonitoredItems results in MXAccess `AdviseSupervisory()` subscriptions for the requested tags.
- Data changes from `OnDataChange` callback are forwarded as OPC UA notifications to all subscribed clients.
- Shared subscriptions: if two OPC UA clients subscribe to the same tag, only one MXAccess subscription exists (ref-counted).
- Last subscriber unsubscribing triggers UnAdvise/RemoveItem on the MXAccess side.
- After MXAccess reconnect, all active MXAccess subscriptions are re-established automatically.
### Details
- Publishing interval from the OPC UA subscription request is honored on the OPC UA side; MXAccess delivers changes as fast as it receives them.
- OPC UA quality mapping from MXAccess quality integers: 192+ = Good, 64-191 = Uncertain, 0-63 = Bad.
- OnDataChange with MXSTATUS_PROXY failure: deliver notification with Bad quality to subscribed clients.
- `HistoryRead` for `Raw`, `Processed`, `AtTime`, and `Events` dispatches through `CapabilityInvoker.InvokeAsync(DriverCapability.HistoryRead, …)`.
- Drivers implementing `IHistoryProvider` today: Galaxy (Wonderware Historian), OPC UA Client (proxy to remote historian).
- Drivers not implementing `IHistoryProvider` return `Bad_HistoryOperationUnsupported` for history requests on their nodes.
- History reads require the `Read` permission bit on the target node.
---
## OPC-010: Address Space Rebuild
## OPC-010: Transport Security Profiles
When a Galaxy deployment change is detected, the server shall rebuild the address space without dropping existing OPC UA client connections where possible.
The server shall offer OPC UA transport-security profiles resolved at startup by `SecurityProfileResolver`.
### Acceptance Criteria
- When Galaxy Repository detects a deployment change, the OPC UA address space is updated.
- Only changed gobject subtrees are torn down and rebuilt; unchanged nodes, subscriptions, and alarm tracking remain intact.
- Existing OPC UA client sessions are preserved — clients stay connected.
- Subscriptions for tags on unchanged objects continue to work without interruption.
- Subscriptions for tags that no longer exist receive a Bad_NodeIdUnknown status notification.
- Sync is logged at Information level with the number of changed gobjects.
### Details
- Uses incremental subtree sync: compares previous hierarchy+attributes with new, identifies changed gobject IDs, expands to include child subtrees, tears down only affected subtrees, and rebuilds them.
- First build (no cached state) performs a full build.
- If no changes are detected, the sync is a no-op (logged and skipped).
- Alarm tracking and MXAccess subscriptions for unchanged objects are not disrupted.
- Falls back to full rebuild behavior if the entire hierarchy changes.
- Supported profiles: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt`.
- Active profile list comes from `OpcUa.SecurityProfile` in `appsettings.json` (bootstrap config) or Config DB (per-cluster override).
- Server certificate is created at first startup even when only `None` is enabled, because UserName-token encryption depends on an ApplicationInstanceCertificate.
- Certificate store root path is configurable (default `%ProgramData%/OtOpcUa/pki/`).
- `AutoAcceptUntrustedClientCertificates` is a config flag; production deployments set it to `false` and operators add trusted client certs via the Admin UI Cert Trust screen.
---
## OPC-011: Server Diagnostics Node
## OPC-011: UserName Authentication
The server shall expose a ServerStatus node under the standard OPC UA Server object with ServerState, CurrentTime, and StartTime. This is required by the OPC UA specification for compliant servers.
The server shall validate `UserNameIdentityToken` credentials against LDAP (production: Active Directory; dev: GLAuth).
### Acceptance Criteria
- ServerState reports Running during normal operation.
- CurrentTime returns the server's current UTC time.
- StartTime returns the UTC time when the service started.
- If `Ldap.Enabled = false`, all UserName tokens are rejected (`BadUserAccessDenied`).
- When enabled, the server performs an LDAP bind using the supplied credentials via `LdapUserAuthenticator`.
- On successful bind, group memberships resolved from LDAP are mapped through `LdapOptions.GroupToRole` to produce the session's permission bits (`ReadOnly`, `WriteOperate`, `WriteTune`, `WriteConfigure`, `AlarmAck`).
- `LdapAuthenticationProvider` implements both `IUserAuthenticationProvider` and `IRoleProvider`.
- UserName tokens are always carried on an encrypted secure channel (either Sign-and-Encrypt transport, or encrypted token using the server certificate even on a `None` channel).
---
## OPC-012: Namespace Configuration
## OPC-012: Capability Dispatch via CapabilityInvoker
The server shall register a namespace URI at namespace index 1. All application-specific NodeIds shall use this namespace.
Every async capability-interface call the server makes shall route through `Core.Resilience.CapabilityInvoker`.
### Acceptance Criteria
- Namespace URI: `urn:ZB:LmxOpcUa` (Galaxy name is configurable).
- All object and variable NodeIds created from Galaxy data use namespace index 1.
- Standard OPC UA nodes remain in namespace 0.
- `CapabilityInvoker.InvokeAsync` resolves a Polly resilience pipeline keyed on `(DriverInstanceId, HostName, DriverCapability)`.
- Read / Discover / Probe / Subscribe / AlarmSubscribe / HistoryRead pipelines carry Timeout + Retry + CircuitBreaker strategies.
- Write / AlarmAcknowledge pipelines carry Timeout + CircuitBreaker only; Retry is enabled only when the tag or capability carries `[WriteIdempotent]` (decision #143).
- Roslyn diagnostic **OTOPCUA0001** fires on any direct call to a capability-interface method from outside `CapabilityInvoker` (enforced via `ZB.MOM.WW.OtOpcUa.Analyzers`).
---
## OPC-013: Session Management
## OPC-013: Per-Host Polly Isolation
Polly pipelines shall be keyed per `(DriverInstanceId, HostName, DriverCapability)` so that a failing device in one driver does not trip the circuit for another device on the same driver or any other driver (decision #144).
### Acceptance Criteria
- A driver serving `N` devices has `N × capabilityCount` distinct pipelines.
- Circuit-breaker state transitions are telemetry-published per pipeline and appear on the Admin UI + `/metrics`.
- A host-scope fault (e.g. shared PLC gateway) naturally trips all devices behind that host but leaves other hosts untouched.
---
## OPC-014: Authorization Gate and Permission Trie
`Security.AuthorizationGate` shall enforce node-level permissions on every browse, read, write, subscribe, alarm-ack, and history call before dispatch.
### Acceptance Criteria
- Permission bits for the session are assembled at login from LDAP group → role → permission mapping plus Config-DB `NodeAcl` rows that modify permission inheritance along the browse tree.
- The permission trie walks from the addressed node toward the root, inheriting permissions unless a `NodeAcl` overrides; first match wins.
- Missing `Read` bit → `Bad_UserAccessDenied` on Read / Subscribe / HistoryRead.
- Missing `Write*` bit (matching the tag's `SecurityClassification`) → `Bad_UserAccessDenied` on Write.
- Missing `AlarmAck` bit → `Bad_UserAccessDenied` on acknowledge.
- Authorization decisions are made at the server layer only — drivers never enforce authorization and only expose `SecurityClassification` metadata.
---
## OPC-015: ServiceLevel Reporting
The server shall expose a dynamic `ServiceLevel` value computed by `RedundancyCoordinator` + `ServiceLevelCalculator`.
### Acceptance Criteria
- `ServiceLevel` reflects: redundancy role (Primary higher than Secondary), publish state (current generation applied > mid-apply > failed-apply), driver health (any driver instance in open circuit lowers the value), apply-lease state.
- `ServiceLevel` is exposed as a Variable under the standard `Server` object and is readable by any authenticated client.
- Clients that observe Primary's `ServiceLevel` drop below Secondary's should failover per the OPC UA spec.
- Single-node deployments (`NodeCount = 1`) always publish their node as Primary.
---
## OPC-016: Session Management
The server shall support multiple concurrent OPC UA client sessions.
### Acceptance Criteria
- Maximum concurrent sessions: configurable, default 100.
- Session timeout: configurable, default 30 minutes of inactivity.
- Expired sessions are cleaned up and their subscriptions removed.
- Session count is reported to the status dashboard.
- Maximum concurrent sessions and session timeout come from Config DB cluster settings (default 100 sessions, 30-minute idle timeout).
- Expired sessions are cleaned up and their subscriptions and monitored items removed.
- Active session count is reported as a Prometheus gauge on the Admin `/metrics` endpoint.
---
## OPC-017: Address Space Rebuild on Generation Apply
When a new Config DB generation is applied, the server shall surgically update only the affected driver subtrees.
### Acceptance Criteria
- Apply compares the previous generation to the incoming generation and produces per-driver add / modify / remove sets.
- Existing OPC UA sessions, subscriptions, and monitored items are preserved across apply whenever the target NodeId survives the generation change.
- Tags that no longer exist post-apply emit `Bad_NodeIdUnknown` on their subscribed monitored items.
- During apply, the node's `ServiceLevel` is lowered (per `ServiceLevelCalculator`) so redundancy partners temporarily take precedence.
- Galaxy subtree rebuilds triggered by `IRediscoverable` (Galaxy deployment change) are scoped to the Galaxy driver's namespace and follow the same preservation rule (OPC-006 from the v1 file, now subsumed).
---
## OPC-018: Server Diagnostics Nodes
The server shall expose standard OPC UA `Server` object nodes required by the spec.
### Acceptance Criteria
- `ServerStatus` / `ServerState` / `CurrentTime` / `StartTime` populated and compliant with the OPC UA 1.05 spec.
- `ServerCapabilities` declares historical access capabilities for namespaces that have an `IHistoryProvider`-backed driver.
- `ServerRedundancy.RedundancySupport` reflects the cluster's redundancy mode (`None` / `Warm` / `Hot`).
- `ServerRedundancy.ServerUriArray` lists both cluster members' `ApplicationUri` values.
---
## OPC-019: Observability Hooks
The server shall emit OpenTelemetry metrics consumed by the Admin `/metrics` Prometheus endpoint.
### Acceptance Criteria
- Counters: capability calls per `DriverInstanceId` + `DriverCapability`, OPC UA requests per method, alarm events emitted, history reads, generation apply attempts.
- Histograms: capability-call duration per `DriverInstanceId` + `DriverCapability`, OPC UA request duration per method.
- Gauges: circuit-breaker state per pipeline, active OPC UA sessions, active monitored items, subscription queue depth, `ServiceLevel` value, memory-tracking watermarks (Phase 6.1).
- Metric cardinality is bounded — `DriverInstanceId` and `HostName` are the only high-cardinality labels, both controlled by the Config DB.

View File

@@ -1,117 +1,265 @@
# Service Host — Component Requirements
Parent: [HLR-006](HighLevelReqs.md#hlr-006-windows-service-hosting), [HLR-007](HighLevelReqs.md#hlr-007-logging)
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). v1 was a single Windows service; v2 ships **three cooperating Windows services** and the service-host requirements are rewritten per-process. SVC-001…SVC-006 from v1 are preserved in spirit (TopShelf, Serilog, config loading, graceful shutdown, startup sequence, unhandled-exception handling) but are now scoped to the process they apply to. SRV-* prefixes the Server process, ADM-* the Admin process, GHX-* the Galaxy Host process. A shared-requirements section at the top covers cross-process concerns (Serilog, logging rotation, bootstrap config scope).
## SVC-001: TopShelf Hosting
Parent: [HLR-007](HighLevelReqs.md#hlr-007-service-hosting), [HLR-008](HighLevelReqs.md#hlr-008-logging), [HLR-011](HighLevelReqs.md#hlr-011-config-db-and-draft-publish)
The application shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode for development.
## Shared Requirements (all three processes)
### Acceptance Criteria
### SVC-SHARED-001: Serilog Logging
- TopShelf HostFactory configures the service with name `LmxOpcUa`, display name `LMX OPC UA Server`.
- Service installs via command line: `ZB.MOM.WW.OtOpcUa.Host.exe install`.
- Service uninstalls via: `ZB.MOM.WW.OtOpcUa.Host.exe uninstall`.
- Service runs as LocalSystem account (needed for MXAccess COM access and Windows Auth to SQL Server).
- Interactive console mode (exe with no args) works for development/debugging.
- `StartAutomatically` is set for Windows service registration.
Every process shall use Serilog with a rolling daily file sink at Information level minimum, plus a console sink, plus opt-in CompactJsonFormatter file sink.
### Details
#### Acceptance Criteria
- Platform target: x86 (32-bit) — required for MXAccess COM interop.
- Service description: "OPC UA server exposing System Platform Galaxy tags via MXAccess."
- Console sink active on every process (for interactive / debug mode).
- Rolling daily file sink:
- Server: `logs/otopcua-YYYYMMDD.log`
- Admin: `logs/otopcua-admin-YYYYMMDD.log`
- Galaxy Host: `%ProgramData%\OtOpcUa\galaxy-host-YYYYMMDD.log`
- Retention count and min level configurable via `Serilog:*` in each process's `appsettings.json`.
- JSON sink opt-in via `Serilog:WriteJson = true` (emits `*.json.log` alongside the plain-text file) for SIEM ingestion.
- `Log.CloseAndFlush()` invoked in a `finally` block on shutdown.
- Structured logging (Serilog message templates) — no `string.Format`.
---
## SVC-002: Serilog Logging
### SVC-SHARED-002: Bootstrap Configuration Scope
The application shall configure Serilog with a rolling daily file sink and console sink, with log files retained for a configurable number of days (default 31).
`appsettings.json` is bootstrap-only per HLR-011. Operational configuration (clusters, drivers, namespaces, tags, ACLs, poll groups) lives in the Config DB.
### Acceptance Criteria
#### Acceptance Criteria
- Console sink active (for interactive/debug mode).
- Rolling daily file sink writing to `logs/lmxopcua-YYYYMMDD.log`.
- Retained file count: configurable, default 31 days.
- Minimum log level: configurable, default Information.
- Log file path: configurable, default `logs/lmxopcua-.log`.
- Serilog is initialized before any other component (first thing in Main).
- `Log.CloseAndFlush()` called in finally block on exit.
### Details
- Structured logging with Serilog message templates (not string.Format).
- Log output includes timestamp, level, source context, message, and exception.
- Fatal exceptions are caught at the top level and logged before exit.
- `appsettings.json` may contain only: Config DB connection string, `Node:NodeId`, `Node:ClusterId`, `Node:LocalCachePath`, `OpcUa:*` security bootstrap fields, `Ldap:*` bootstrap fields, `Serilog:*`, `Redundancy:*` role id.
- Any attempt to configure driver instances, tags, or equipment through `appsettings.json` shall be rejected at startup with a descriptive error.
- Invalid or missing required bootstrap fields are detected at startup with a clear error (`"Node:NodeId not configured"` style).
---
## SVC-003: Configuration
## OtOpcUa.Server — Service Host Requirements (SRV-*)
The application shall load configuration from appsettings.json with support for environment-specific overrides (appsettings.*.json) and environment variables.
### SRV-001: Microsoft.Extensions.Hosting + AddWindowsService
### Acceptance Criteria
The Server shall use `Host.CreateApplicationBuilder(args)` with `AddWindowsService(o => o.ServiceName = "OtOpcUa")` to run as a Windows service.
- `appsettings.json` is the primary configuration file.
- Environment-specific overrides via `appsettings.{environment}.json`.
- Configuration sections: `OpcUa`, `MxAccess`, `GalaxyRepository`, `Dashboard`.
- Missing optional configuration keys use documented defaults (service does not crash).
- Invalid configuration (e.g., port = -1) is detected at startup with a clear error message.
#### Acceptance Criteria
### Details
- Config is loaded once at startup. No hot-reload (service restart required for config changes). This is appropriate for an industrial service.
- All configurable values and their defaults are documented in `appsettings.json`.
- Service name `OtOpcUa`.
- Installs via standard `sc.exe` tooling or the build-provided installer.
- Runs as a configured service account (typically a domain service account with Config DB read access; Windows Auth to SQL Server).
- Console mode (running `ZB.MOM.WW.OtOpcUa.Server.exe` with no Windows service context) works for development and debugging.
- Platform target: .NET 10 x64 (default per decision in `plan.md` §3).
---
## SVC-004: Graceful Shutdown
### SRV-002: Startup Sequence
On service stop, the application shall gracefully shut down all components and flush logs before exiting.
The Server shall start components in a defined order, with failure handling at each step.
### Acceptance Criteria
- TopShelf WhenStopped triggers orderly shutdown.
- Shutdown sequence: (1) stop change detection polling, (2) stop OPC UA server (stop accepting new sessions, complete pending operations), (3) disconnect MXAccess (cleanup all COM objects), (4) stop status dashboard HTTP listener, (5) flush Serilog.
- Shutdown completes within 30 seconds (Windows SCM timeout).
- All IDisposable components are disposed in reverse-creation order.
### Details
- `CancellationTokenSource` signals all background loops (monitor, change detection, HTTP listener) to stop.
- Log "Service shutdown complete" at Information level as the final log entry before flush.
---
## SVC-005: Startup Sequence
The service shall start components in a defined order, with failure handling at each step.
### Acceptance Criteria
#### Acceptance Criteria
- Startup sequence:
1. Load configuration
2. Initialize Serilog
3. Start STA thread
4. Connect to MXAccess
5. Query Galaxy Repository for initial build
6. Build OPC UA address space
7. Start OPC UA server listener
8. Start change detection polling
9. Start status dashboard HTTP listener
- Failure in steps 1-4 prevents startup (service fails to start).
- Failure in steps 5-9 logs Error but allows the service to run in degraded mode.
### Details
- Degraded mode means the service is running but may have an empty address space (waiting for Galaxy DB) or no dashboard (port conflict). MXAccess connection is the minimum required for the service to be useful.
1. Load `appsettings.json` bootstrap configuration + initialize Serilog.
2. Validate bootstrap fields (NodeId, ClusterId, Config DB connection).
3. Initialize `OpcUaApplicationHost` (server-certificate resolution via `SecurityProfileResolver`).
4. Connect to Config DB; request current published generation for `ClusterId`.
5. If unreachable, fall back to `LiteDbConfigCache` (latest applied generation).
6. Apply generation: register driver instances, build namespaces, wire capability pipelines.
7. Start `OpcUaServerService` hosted service (opens endpoint listener).
8. Start `HostStatusPublisher` (pushes `ClusterNodeGenerationState` to Config DB for Admin UI SignalR consumers).
9. Start `RedundancyCoordinator` + `ServiceLevelCalculator`.
- Failure in steps 1-3 prevents startup.
- Failure in steps 4-6 logs Error and enters degraded mode (empty namespaces, `DriverHealth.Unavailable` on every driver, `ServiceLevel = 0`).
- Failure in steps 7-9 logs Error and shuts down (endpoint is non-optional).
---
## SVC-006: Unhandled Exception Handling
### SRV-003: Graceful Shutdown
The service shall handle unexpected crashes gracefully.
On service stop, the Server shall gracefully shut down all driver instances, the OPC UA listener, and flush logs before exiting.
### Acceptance Criteria
#### Acceptance Criteria
- Register `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
- TopShelf service recovery is configured: restart on failure with 60-second delay.
- Fatal-level log entry includes the full exception details.
- `IHostApplicationLifetime.ApplicationStopping` triggers orderly shutdown.
- Shutdown sequence: stop `HostStatusPublisher` → stop driver instances (disconnect each via `IDriver.DisposeAsync`, which for Galaxy tears down the named pipe) → stop OPC UA server (stop accepting new sessions, complete pending reads/writes) → flush Serilog.
- Shutdown completes within 30 seconds (Windows SCM timeout).
- All `IDisposable` / `IAsyncDisposable` components disposed in reverse-creation order.
- Final log entry: `"OtOpcUa.Server shutdown complete"` at Information level.
---
### SRV-004: Unhandled Exception Handling
The Server shall handle unexpected crashes gracefully.
#### Acceptance Criteria
- Registers `AppDomain.CurrentDomain.UnhandledException` handler that logs Fatal before the process terminates.
- Windows service recovery configured: restart on failure with 60-second delay.
- Fatal log entry includes full exception details.
---
### SRV-005: Drivers Hosted In-Process
All drivers except Galaxy run in-process within the Server.
#### Acceptance Criteria
- Modbus TCP, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client drivers are resolved from the DI container and managed by `DriverHost`.
- Galaxy driver in-process component is `Driver.Galaxy.Proxy`, which forwards to `OtOpcUa.Galaxy.Host` over the named pipe (see GHX-*).
- Each driver instance's lifecycle (connect, discover, subscribe, dispose) is orchestrated by `DriverHost`.
---
### SRV-006: Redundancy-Node Bootstrap
The Server shall bootstrap its redundancy identity from `appsettings.json` and the Config DB.
#### Acceptance Criteria
- `Node:NodeId` + `Node:ClusterId` identify this node uniquely; the `Redundancy` coordinator looks up `ClusterNode.RedundancyRole` (Primary / Secondary) from the Config DB.
- Two nodes of the same cluster connect to the same Config DB and the same ClusterId but have different NodeIds and different `ApplicationUri` values.
- Missing or ambiguous `(ClusterId, NodeId)` causes startup failure.
---
## OtOpcUa.Admin — Service Host Requirements (ADM-*)
### ADM-001: ASP.NET Core Blazor Server
The Admin app shall use `WebApplication.CreateBuilder` with Razor Components (`AddRazorComponents().AddInteractiveServerComponents()`), SignalR, and cookie authentication.
#### Acceptance Criteria
- Blazor Server (not WebAssembly) per `plan.md` §Tech Stack.
- Hosts SignalR hubs for live cluster state (used by `ClusterNodeGenerationState` views, crash-loop alerts, etc.).
- Runs as a Windows service via `AddWindowsService` OR as a standard ASP.NET Core process behind IIS / reverse proxy (site decides).
- Platform target: .NET 10 x64.
---
### ADM-002: Authentication and Authorization
Admin users authenticate via LDAP bind with cookie auth; three admin roles gate operations.
#### Acceptance Criteria
- Cookie auth scheme: `OtOpcUa.Admin`, 8-hour expiry, path `/login` for challenge.
- LDAP bind via `LdapAuthService`; user group memberships map to admin roles (`ConfigViewer`, `ConfigEditor`, `FleetAdmin`).
- Authorization policies:
- `CanEdit` requires `ConfigEditor` or `FleetAdmin`.
- `CanPublish` requires `FleetAdmin`.
- View-only access requires `ConfigViewer` (or higher).
- Unauthenticated requests to any Admin page redirect to `/login`.
- Per-cluster role grants layer on top: a `ConfigEditor` with no grant for cluster X can view it but not edit.
---
### ADM-003: Config DB as Sole Write Path
The Admin service shall be the only process with write access to the Config DB.
#### Acceptance Criteria
- EF Core `OtOpcUaConfigDbContext` configured with the SQL login / connection string that has read+write permission on config tables.
- Server nodes connect with a read-only principal (`grant SELECT` only).
- Admin writes produce draft-generation rows; publish writes are atomic and transactional.
- Every write is audited via `AuditLogService` per ADM-006.
---
### ADM-004: Prometheus /metrics Endpoint
The Admin service shall expose an OpenTelemetry → Prometheus metrics endpoint at `/metrics`.
#### Acceptance Criteria
- `OpenTelemetry.Metrics` registered with Prometheus exporter.
- `/metrics` scrapeable without authentication (standard Prometheus pattern) OR gated behind an infrastructure allow-list (site-configurable).
- Exports metrics from Server nodes of managed clusters (aggregated via Config DB heartbeat telemetry) plus Admin-local metrics (login attempts, publish duration, active sessions).
---
### ADM-005: Graceful Shutdown
On shutdown, the Admin service shall disconnect SignalR clients cleanly, finish in-flight DB writes, and flush Serilog.
#### Acceptance Criteria
- `IHostApplicationLifetime.ApplicationStopping` closes SignalR hub connections gracefully.
- In-flight publish transactions are allowed to complete up to 30 seconds.
- Final log entry: `"OtOpcUa.Admin shutdown complete"`.
---
### ADM-006: Audit Logging
Every publish and every ACL / role-grant change shall produce an immutable audit row via `AuditLogService`.
#### Acceptance Criteria
- Audit rows include: timestamp (UTC), acting principal (LDAP DN + display name), action, entity kind + id, before/after generation number where applicable, session id, source IP.
- Audit rows are never mutated or deleted by application code.
- Audit table schema enforces immutability via DB permissions (no UPDATE / DELETE granted to the Admin app's principal).
---
## OtOpcUa.Galaxy.Host — Service Host Requirements (GHX-*)
### GHX-001: TopShelf Windows Service Hosting
The Galaxy Host shall use TopShelf for Windows service lifecycle (install, uninstall, start, stop) and interactive console mode.
#### Acceptance Criteria
- Service name `OtOpcUaGalaxyHost`, display name `OtOpcUa Galaxy Host`.
- Installs via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe install`.
- Uninstalls via `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe uninstall`.
- Runs as a configured user account (typically the same account as the Server, or a dedicated Galaxy service account with ArchestrA platform access).
- Interactive console mode (no args) for development / debugging.
- Platform target: **.NET Framework 4.8 x86** — required for MXAccess COM 32-bit interop.
- Development deployments may use NSSM in place of TopShelf (memory: `project_galaxy_host_installed`).
### Details
- Service description: "OtOpcUa Galaxy Host — MXAccess + Galaxy Repository backend for the Galaxy driver, named-pipe IPC to OtOpcUa.Server."
---
### GHX-002: Named-Pipe IPC Bootstrap
The Host shall open a named pipe on startup whose name, ACL, and shared secret come from environment variables supplied by the supervisor at spawn time.
#### Acceptance Criteria
- `OTOPCUA_GALAXY_PIPE` → pipe name (default `OtOpcUaGalaxy`).
- `OTOPCUA_ALLOWED_SID` → SID of the principal allowed to connect; any other principal is denied at the ACL layer.
- `OTOPCUA_GALAXY_SECRET` → per-process shared secret; `Driver.Galaxy.Proxy` must present it on handshake.
- `OTOPCUA_GALAXY_BACKEND``stub` / `db` / `mxaccess` (default `mxaccess`) — selects which backend implementation is loaded.
- Missing `OTOPCUA_ALLOWED_SID` or `OTOPCUA_GALAXY_SECRET` at startup throws with a descriptive error.
---
### GHX-003: Backend Lifecycle
The Host shall instantiate the STA pump + MXAccess backend + Galaxy Repository + optional Historian plugin in a defined order and tear them down cleanly on shutdown.
#### Acceptance Criteria
- Startup (mxaccess backend): initialize Serilog → resolve env vars → create `PipeServer` → start `StaPump` → create `MxAccessClient` on STA thread → initialize `GalaxyRepository` → optionally initialize Historian plugin → begin pipe request handling.
- Shutdown: stop pipe → dispose MxAccessClient (MXA-007 COM cleanup) → dispose STA pump → flush Serilog.
- Shutdown must complete within 30 seconds (Windows SCM timeout).
- `Console.CancelKeyPress` triggers the same sequence in console mode.
---
### GHX-004: Unhandled Exception Handling
The Host shall log Fatal on crash and let the supervisor restart it.
#### Acceptance Criteria
- `AppDomain.CurrentDomain.UnhandledException` handler logs Fatal with full exception details before termination.
- The supervisor's driver-stability policy (`docs/v2/driver-stability.md`) governs restart behavior — backoff, crash-loop detection, and alerting live there, not in the Host.
- Server-side: `Driver.Galaxy.Proxy` detects pipe disconnect, opens its capability circuit, reports Bad quality on Galaxy nodes; reconnects automatically when the Host is back.

View File

@@ -1,157 +1,29 @@
# Status Dashboard — Component Requirements
# Status Dashboard — Retired
Parent: [HLR-009](HighLevelReqs.md#hlr-009-status-dashboard)
> **Revision** — Retired 2026-04-19 (task #205). The embedded HTTP Status Dashboard hosted inside the v1 LmxOpcUa service (`Dashboard:Port 8081`) has been **superseded by the Admin UI** introduced in OtOpcUa v2. The requirements formerly numbered DASH-001 through DASH-009 no longer apply.
Reference: LmxProxy Status Dashboard (see `dashboard.JPG` in project root).
## What replaces it
## DASH-001: Embedded HTTP Endpoint
Operator surface is now the **OtOpcUa Admin** Blazor Server web app:
The service shall host a lightweight HTTP listener on a configurable port serving a self-contained HTML status dashboard page (no external dependencies).
- Canonical design doc: `docs/v2/admin-ui.md`
- High-level operator surface requirement: [HLR-015](HighLevelReqs.md#hlr-015-admin-ui-operator-surface)
- Service-host requirements for the Admin process: [ServiceHostReqs.md → ADM-*](ServiceHostReqs.md#otopcua-admin---service-host-requirements-adm-)
- Cross-cluster metrics endpoint: `/metrics` on the Admin app — see [HLR-017](HighLevelReqs.md#hlr-017-prometheus-metrics).
- Audit log: see [HLR-016](HighLevelReqs.md#hlr-016-audit-logging) and `AuditLogService`.
### Acceptance Criteria
## Mapping from retired DASH-* requirements to today's surface
- Uses `System.Net.HttpListener` on a configurable port (`Dashboard:Port`, default 8081).
- Routes:
- `GET /` → HTML dashboard
- `GET /api/status` → JSON status report
- `GET /api/health` → 200 OK if healthy, 503 if unhealthy
- Only GET requests accepted; other methods return 405.
- Unknown paths return 404.
- All responses include `Cache-Control: no-cache, no-store, must-revalidate` headers.
- Dashboard can be disabled via config (`Dashboard:Enabled`, default true).
| Retired requirement | Replacement |
|---------------------|-------------|
| DASH-001 Embedded HTTP listener | Admin UI (Blazor Server) hosted in the `OtOpcUa.Admin` process. |
| DASH-002 Connection panel | Admin UI cluster-node view (live via SignalR) shows per-driver connection state. |
| DASH-003 Health panel | Admin UI renders `DriverHealth` + Polly circuit state per driver instance; cluster-level rollup on the cluster dashboard. |
| DASH-004 Subscriptions panel | Prometheus gauges (session count, monitored-item count, driver-subscription count) exposed via `/metrics`. |
| DASH-005 Operations table | Capability-call duration histograms + counts exposed via `/metrics`; Admin UI renders latency summaries per `DriverInstanceId`. |
| DASH-006 Footer (last-updated + version) | Admin UI footer; version stamped from the assembly version of the Admin app. |
| DASH-007 Auto-refresh | Admin UI uses SignalR push for live updates — no meta-refresh. |
| DASH-008 JSON status API | Prometheus `/metrics` endpoint is the programmatic surface. |
| DASH-009 Galaxy info panel | Admin UI Galaxy-driver-instance detail view (driver config, last discovery time, Galaxy DB connection state, MXAccess pipe health). |
### Details
- HTTP prefix: `http://+:{port}/` to bind to all interfaces.
- If HttpListener fails to start (port conflict, missing URL reservation), log Error and continue service startup without the dashboard.
- HTML page is self-contained: inline CSS, no external resources (no CDN, no JavaScript frameworks).
---
## DASH-002: Connection Panel
The dashboard shall display a Connection panel showing MXAccess connection state.
### Acceptance Criteria
- Shows: **Connected** (True/False), **State** (Connected/Disconnected/Reconnecting/Error), **Connected Since** (UTC timestamp).
- Green left border when Connected, red when Disconnected/Error, yellow when Reconnecting.
- "Connected Since" shows "N/A" when not connected.
- Data sourced from MXAccess client's connection state properties.
### Details
- Timestamp format: `yyyy-MM-dd HH:mm:ss UTC`.
- Panel title: "Connection".
---
## DASH-003: Health Panel
The dashboard shall display a Health panel showing overall service health.
### Acceptance Criteria
- Three states: **Healthy** (green text), **Degraded** (yellow text), **Unhealthy** (red text).
- Includes a health message string explaining the status.
- Health rules:
- Not connected to MXAccess → Unhealthy
- Success rate < 50% with > 100 total operations → Degraded
- Connected with acceptable success rate → Healthy
### Details
- Health message examples: "LmxOpcUa is healthy", "MXAccess client is not connected", "Average success rate is below 50%".
- Green left border for Healthy, yellow for Degraded, red for Unhealthy.
---
## DASH-004: Subscriptions Panel
The dashboard shall display a Subscriptions panel showing subscription statistics.
### Acceptance Criteria
- Shows: **Clients** (connected OPC UA client count), **Tags** (total variable nodes in address space), **Active** (active MXAccess subscriptions), **Delivered** (cumulative data change notifications delivered).
- Values update on each dashboard refresh.
- Zero values shown as "0", not blank.
### Details
- "Tags" is the count of variable nodes, not object/folder nodes.
- "Active" is the count of distinct MXAccess item subscriptions (after ref-counting — the number of actual AdviseSupervisory calls, not the number of OPC UA monitored items).
- "Delivered" is a running counter since service start (not reset on reconnect).
---
## DASH-005: Operations Table
The dashboard shall display an operations metrics table showing performance statistics.
### Acceptance Criteria
- Table with columns: **Operation**, **Count**, **Success Rate**, **Avg (ms)**, **Min (ms)**, **Max (ms)**, **P95 (ms)**.
- Rows: Read, Write, Subscribe, Browse.
- Empty cells show em-dash ("—") when no data available (count = 0).
- Success rate displayed as percentage (e.g., "99.8%").
- Latency values rounded to 1 decimal place.
### Details
- Metrics sourced from the PerformanceMetrics component (1000-entry rolling buffer for percentile calculation).
- "Browse" row tracks OPC UA browse operations.
- "Subscribe" row tracks OPC UA CreateMonitoredItems operations.
---
## DASH-006: Footer
The dashboard shall display a footer with last-updated time and service identification.
### Acceptance Criteria
- Format: "Last updated: {timestamp} UTC | Service: ZB.MOM.WW.OtOpcUa.Host v{version}".
- Timestamp is the server-side UTC time when the HTML was generated.
- Version is read from the assembly version (`Assembly.GetExecutingAssembly().GetName().Version`).
---
## DASH-007: Auto-Refresh
The dashboard page shall auto-refresh to show current status without manual reload.
### Acceptance Criteria
- HTML page includes `<meta http-equiv="refresh" content="10">` for 10-second auto-refresh.
- No JavaScript required for refresh (pure HTML meta-refresh).
- Refresh interval: configurable via `Dashboard:RefreshIntervalSeconds`, default 10 seconds.
---
## DASH-008: JSON Status API
The `/api/status` endpoint shall return a JSON object with all dashboard data for programmatic consumption.
### Acceptance Criteria
- Response Content-Type: `application/json`.
- JSON structure includes: connection state, health status, subscription statistics, and operation metrics.
- Same data as the HTML dashboard, structured for machine consumption.
- Suitable for integration with external monitoring tools.
---
## DASH-009: Galaxy Info Panel
The dashboard shall display a Galaxy Info panel showing Galaxy Repository state.
### Acceptance Criteria
- Shows: **Galaxy Name** (e.g., ZB), **DB Status** (Connected/Disconnected), **Last Deploy** (timestamp from `galaxy.time_of_last_deploy`), **Objects** (count), **Attributes** (count), **Last Rebuild** (timestamp of last address space rebuild).
- Provides visibility into the Galaxy Repository component's state independently of MXAccess connection status.
### Details
- "DB Status" reflects whether the most recent change detection poll succeeded.
- "Last Deploy" shows the raw `time_of_last_deploy` value from the Galaxy database.
- "Objects" and "Attributes" show counts from the most recent successful hierarchy/attribute query.
A formal requirements-level doc for the Admin UI (AdminUiReqs.md) is not yet written — the design doc at `docs/v2/admin-ui.md` serves as the authoritative reference until formal cert-compliance requirements are needed.

View File

@@ -1,15 +1,28 @@
# Transport Security
# Security
## Overview
OtOpcUa has four independent security concerns. This document covers all four:
The LmxOpcUa server supports configurable transport security profiles that control how data is protected on the wire between OPC UA clients and the server.
1. **Transport security** — OPC UA secure channel (signing, encryption, X.509 trust).
2. **OPC UA authentication** — Anonymous / UserName / X.509 session identities; UserName tokens authenticated by LDAP bind.
3. **Data-plane authorization** — who can browse, read, subscribe, write, acknowledge alarms on which nodes. Evaluated by `PermissionTrie` against the Config DB `NodeAcl` tree.
4. **Control-plane authorization** — who can view or edit fleet configuration in the Admin UI. Gated by the `AdminRole` (`ConfigViewer` / `ConfigEditor` / `FleetAdmin`) claim from `LdapGroupRoleMapping`.
Transport security and OPC UA authentication are per-node concerns configured in the Server's bootstrap `appsettings.json`. Data-plane ACLs and Admin role grants live in the Config DB.
---
## Transport Security
### Overview
The OtOpcUa Server supports configurable OPC UA transport security profiles that control how data is protected on the wire between OPC UA clients and the server.
There are two distinct layers of security in OPC UA:
- **Transport security** -- secures the communication channel itself using TLS-style certificate exchange, message signing, and encryption. This is what the `Security` configuration section controls.
- **UserName token encryption** -- protects user credentials (username/password) sent during session activation. The OPC UA stack encrypts UserName tokens using the server's application certificate regardless of the transport security mode. This means UserName authentication works on `None` endpoints too — the credentials themselves are always encrypted. However, a secure transport profile adds protection against message-level tampering and eavesdropping of data payloads.
- **Transport security** -- secures the communication channel itself using TLS-style certificate exchange, message signing, and encryption. This is what the `OpcUaServer:SecurityProfile` setting controls.
- **UserName token encryption** -- protects user credentials (username/password) sent during session activation. The OPC UA stack encrypts UserName tokens using the server's application certificate regardless of the transport security mode. UserName authentication therefore works on `None` endpoints too — the credentials themselves are always encrypted. A secure transport profile adds protection against message-level tampering and eavesdropping of data payloads.
## Supported Security Profiles
### Supported security profiles
The server supports seven transport security profiles:
@@ -23,334 +36,88 @@ The server supports seven transport security profiles:
| `Aes256_Sha256_RsaPss-Sign` | Aes256_Sha256_RsaPss | Sign | Strongest profile with AES-256 and RSA-PSS signatures. |
| `Aes256_Sha256_RsaPss-SignAndEncrypt` | Aes256_Sha256_RsaPss | SignAndEncrypt | Strongest profile. Recommended for high-security deployments. |
Multiple profiles can be enabled simultaneously. The server exposes a separate endpoint for each configured profile, and clients select the one they prefer during connection.
The server exposes a separate endpoint for each configured profile, and clients select the one they prefer during connection.
If no valid profiles are configured (or all names are unrecognized), the server falls back to `None` with a warning in the log.
### Configuration
## Configuration
Transport security is configured in the `Security` section of `appsettings.json`:
Transport security is configured in the `OpcUaServer` section of the Server process's bootstrap `appsettings.json`:
```json
{
"Security": {
"Profiles": ["None"],
"AutoAcceptClientCertificates": true,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048,
"PkiRootPath": null,
"CertificateSubject": null
"OpcUaServer": {
"EndpointUrl": "opc.tcp://0.0.0.0:4840/OtOpcUa",
"ApplicationName": "OtOpcUa Server",
"ApplicationUri": "urn:node-a:OtOpcUa",
"PkiStoreRoot": "C:/ProgramData/OtOpcUa/pki",
"AutoAcceptUntrustedClientCertificates": false,
"SecurityProfile": "Basic256Sha256-SignAndEncrypt"
}
}
```
### Properties
The server certificate is auto-generated on first start if none exists in `PkiStoreRoot/own/`. Always generated even for `None`-only deployments because UserName token encryption depends on it.
| Property | Type | Default | Description |
|--------------------------------|------------|--------------------------------------------------|-------------|
| `Profiles` | `string[]` | `["None"]` | List of security profile names to expose as server endpoints. Valid values: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt`. Profile names are case-insensitive. Duplicates are ignored. |
| `AutoAcceptClientCertificates` | `bool` | `true` | When `true`, the server automatically trusts client certificates that are not already in the trusted store. Set to `false` in production for explicit trust management. |
| `RejectSHA1Certificates` | `bool` | `true` | When `true`, client certificates signed with SHA-1 are rejected. SHA-1 is considered cryptographically weak. |
| `MinimumCertificateKeySize` | `int` | `2048` | Minimum RSA key size (in bits) required for client certificates. Certificates with shorter keys are rejected. |
| `PkiRootPath` | `string?` | `null` (defaults to `%LOCALAPPDATA%\OPC Foundation\pki`) | Override for the PKI root directory where certificates are stored. When `null`, uses the OPC Foundation default location. |
| `CertificateSubject` | `string?` | `null` (defaults to `CN={ServerName}, O=ZB MOM, DC=localhost`) | Override for the server certificate subject name. When `null`, the subject is derived from the configured `ServerName`. |
### Example: Development (no security)
```json
{
"Security": {
"Profiles": ["None"],
"AutoAcceptClientCertificates": true
}
}
```
### Example: Production (encrypted only)
```json
{
"Security": {
"Profiles": ["Basic256Sha256-SignAndEncrypt"],
"AutoAcceptClientCertificates": false,
"RejectSHA1Certificates": true,
"MinimumCertificateKeySize": 2048
}
}
```
### Example: Mixed (sign and encrypt endpoints, no plaintext)
```json
{
"Security": {
"Profiles": ["Basic256Sha256-Sign", "Basic256Sha256-SignAndEncrypt"],
"AutoAcceptClientCertificates": false
}
}
```
## PKI Directory Layout
The server stores certificates in a directory-based PKI store. The default root is:
### PKI directory layout
```
%LOCALAPPDATA%\OPC Foundation\pki\
```
This can be overridden with the `PkiRootPath` setting. The directory structure is:
```
pki/
{PkiStoreRoot}/
own/ Server's own application certificate and private key
issuer/ CA certificates that issued trusted client certificates
trusted/ Explicitly trusted client (peer) certificates
rejected/ Certificates that were presented but not trusted
```
### Certificate Trust Flow
### Certificate trust flow
When a client connects using a secure profile (`Sign` or `SignAndEncrypt`), the following trust evaluation occurs:
1. The client presents its application certificate during the secure channel handshake.
2. The server checks whether the certificate exists in the `trusted/` store.
3. If found, the connection proceeds (subject to key size and SHA-1 checks).
4. If not found and `AutoAcceptClientCertificates` is `true`, the certificate is automatically copied to `trusted/` and the connection proceeds.
5. If not found and `AutoAcceptClientCertificates` is `false`, the certificate is copied to `rejected/` and the connection is refused.
6. Regardless of trust status, the certificate must meet the `MinimumCertificateKeySize` requirement and pass the SHA-1 check (if `RejectSHA1Certificates` is `true`).
3. If found, the connection proceeds.
4. If not found and `AutoAcceptUntrustedClientCertificates` is `true`, the certificate is automatically copied to `trusted/` and the connection proceeds.
5. If not found and `AutoAcceptUntrustedClientCertificates` is `false`, the certificate is copied to `rejected/` and the connection is refused.
On first startup with a secure profile, the server automatically generates a self-signed application certificate in the `own/` directory if one does not already exist.
The Admin UI `Certificates.razor` page uses `CertTrustService` (singleton reading `CertTrustOptions` for the Server's `PkiStoreRoot`) to promote rejected client certs to trusted without operators having to file-copy manually.
## Production Hardening
### Production hardening
The default settings prioritize ease of development. Before deploying to production, apply the following changes:
### 1. Disable automatic certificate acceptance
Set `AutoAcceptClientCertificates` to `false` so that only explicitly trusted client certificates are accepted:
```json
{
"Security": {
"AutoAcceptClientCertificates": false
}
}
```
After changing this setting, you must manually copy each client's application certificate (the `.der` file) into the `trusted/` directory.
### 2. Remove the None profile
Remove `None` from the `Profiles` list to prevent unencrypted connections:
```json
{
"Security": {
"Profiles": ["Aes256_Sha256_RsaPss-SignAndEncrypt"]
}
}
```
### 3. Configure LDAP authentication
Enable LDAP authentication to validate credentials against the GLAuth server. LDAP group membership controls what each user can do (read, write, alarm acknowledgment). See [Configuration Guide](Configuration.md) for the full LDAP property reference.
```json
{
"Authentication": {
"AllowAnonymous": false,
"AnonymousCanWrite": false,
"Ldap": {
"Enabled": true,
"Host": "localhost",
"Port": 3893,
"BaseDN": "dc=lmxopcua,dc=local",
"ServiceAccountDn": "cn=serviceaccount,dc=lmxopcua,dc=local",
"ServiceAccountPassword": "serviceaccount123"
}
}
}
```
While UserName tokens are always encrypted by the OPC UA stack (using the server certificate), enabling a secure transport profile adds protection against message-level tampering and data eavesdropping.
### 4. Review the rejected certificate store
Periodically inspect the `rejected/` directory. Certificates that appear here were presented by clients but were not trusted. If you recognize a legitimate client certificate, move it to the `trusted/` directory to grant access.
## X.509 Certificate Authentication
The server supports X.509 certificate-based user authentication in addition to Anonymous and UserName tokens. When any non-None security profile is configured, the server advertises `UserTokenType.Certificate` in its endpoint descriptions.
Clients can authenticate by presenting an X.509 certificate. The server extracts the Common Name (CN) from the certificate subject and assigns the `AuthenticatedUser` and `ReadOnly` roles. The authentication is logged with the certificate's CN, subject, and thumbprint.
X.509 authentication is available automatically when transport security is enabled -- no additional configuration is required.
## Audit Logging
The server generates audit log entries for security-relevant operations. All audit entries use the `AUDIT:` prefix and are written to the Serilog rolling file sink for compliance review.
Audited events:
- **Authentication success**: Logs username, assigned roles, and session ID
- **Authentication failure**: Logs username and session ID
- **X.509 authentication**: Logs certificate CN, subject, and thumbprint
- **Certificate validation**: Logs certificate subject, thumbprint, and expiry for all validation events (accepted or rejected)
- **Write access denial**: Logged by the role-based access control system when a user lacks the required role
Example audit log entries:
```
AUDIT: Authentication SUCCESS for user admin with roles [ReadOnly, WriteOperate, AlarmAck] session abc123
AUDIT: Authentication FAILED for user baduser from session def456
X509 certificate authenticated: CN=ClientApp, Subject=CN=ClientApp,O=Acme, Thumbprint=AB12CD34
```
## CLI Examples
The Client CLI supports the `-S` (or `--security`) flag to select the transport security mode when connecting. Valid values are `none`, `sign`, `encrypt`, and `signandencrypt`.
### Connect with no security
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S none
```
### Connect with signing
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S sign
```
### Connect with signing and encryption
```bash
dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt
```
### Browse with encryption and authentication
```bash
dotnet run -- browse -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt -U operator -P secure-password -r -d 3
```
### Read a node with signing
```bash
dotnet run -- read -u opc.tcp://localhost:4840/LmxOpcUa -S sign -n "ns=2;s=TestMachine_001/Speed"
```
The CLI tool auto-generates its own client certificate on first use (stored under `%LOCALAPPDATA%\OpcUaCli\pki\own\`). When connecting to a server with `AutoAcceptClientCertificates` set to `false`, you must copy the CLI tool's certificate into the server's `trusted/` directory before the connection will succeed.
## Troubleshooting
### Certificate trust failure
**Symptom:** The client receives a `BadSecurityChecksFailed` or `BadCertificateUntrusted` error when connecting.
**Cause:** The server does not trust the client's certificate (or vice versa), and `AutoAcceptClientCertificates` is `false`.
**Resolution:**
1. Check the server's `rejected/` directory for the client's certificate file.
2. Copy the `.der` file from `rejected/` to `trusted/`.
3. Retry the connection.
4. If the server's own certificate is not trusted by the client, copy the server's certificate from `pki/own/certs/` to the client's trusted store.
### Endpoint mismatch
**Symptom:** The client receives a `BadSecurityModeRejected` or `BadSecurityPolicyRejected` error, or reports "No endpoint found with security mode...".
**Cause:** The client is requesting a security mode that the server does not expose. For example, the client requests `SignAndEncrypt` but the server only has `None` configured.
**Resolution:**
1. Verify the server's configured `Profiles` in `appsettings.json`.
2. Ensure the profile matching the client's requested mode is listed (e.g., add `Basic256Sha256-SignAndEncrypt` for encrypted connections).
3. Restart the server after changing the configuration.
4. Use the CLI tool to verify available endpoints:
```bash
dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S none
```
The output displays the security mode and policy of the connected endpoint.
### Server certificate not generated
**Symptom:** The server logs a warning about application certificate check failure on startup.
**Cause:** The `pki/own/` directory may not be writable, or the certificate generation failed.
**Resolution:**
1. Ensure the service account has write access to the PKI root directory.
2. Check that the `PkiRootPath` (if overridden) points to a valid, writable location.
3. Delete any corrupt certificate files in `pki/own/` and restart the server to trigger regeneration.
### SHA-1 certificate rejection
**Symptom:** A client with a valid certificate is rejected, and the server logs mention SHA-1.
**Cause:** The client's certificate was signed with SHA-1, and `RejectSHA1Certificates` is `true` (the default).
**Resolution:**
- Regenerate the client certificate using SHA-256 or stronger (recommended).
- Alternatively, set `RejectSHA1Certificates` to `false` in the server configuration (not recommended for production).
- Set `AutoAcceptUntrustedClientCertificates = false`.
- Drop `None` from the profile set.
- Use the Admin UI to promote trusted client certs rather than the auto-accept fallback.
- Periodically audit the `rejected/` directory; an unexpected entry is often a misconfigured client or a probe attempt.
---
## LDAP Authentication
## OPC UA Authentication
The server supports LDAP-based user authentication via GLAuth (or any standard LDAP server). When enabled, OPC UA `UserName` token credentials are validated by LDAP bind. LDAP group membership is resolved once during authentication and mapped to custom OPC UA role `NodeId`s in the `urn:zbmom:lmxopcua:roles` namespace. These role NodeIds are stored on the session's `RoleBasedIdentity.GrantedRoleIds` and checked directly during write and alarm-ack operations.
The Server accepts three OPC UA identity-token types:
### Architecture
| Token | Handler | Notes |
|---|---|---|
| Anonymous | `IUserAuthenticator.AuthenticateAsync(username: "", password: "")` | Refused in strict mode unless explicit anonymous grants exist; allowed in lax mode for backward compatibility. |
| UserName/Password | `LdapUserAuthenticator` (`src/ZB.MOM.WW.OtOpcUa.Server/Security/LdapUserAuthenticator.cs`) | LDAP bind + group lookup; resolved `LdapGroups` flow into the session's identity bearer (`ILdapGroupsBearer`). |
| X.509 Certificate | Stack-level acceptance + role mapping via CN | X.509 identity carries `AuthenticatedUser` + read roles; finer-grain authorization happens through the data-plane ACLs. |
```
OPC UA Client → UserName Token → LmxOpcUa Server → LDAP Bind (validate credentials)
→ LDAP Search (resolve group membership)
→ Map groups to OPC UA role NodeIds
→ Store on RoleBasedIdentity.GrantedRoleIds
→ Permission checks via GrantedRoleIds.Contains()
### LDAP bind flow (`LdapUserAuthenticator`)
`Program.cs` in the Server registers the authenticator based on `OpcUaServer:Ldap`:
```csharp
builder.Services.AddSingleton<IUserAuthenticator>(sp => ldapOptions.Enabled
? new LdapUserAuthenticator(ldapOptions, sp.GetRequiredService<ILogger<LdapUserAuthenticator>>())
: new DenyAllUserAuthenticator());
```
### LDAP Groups and OPC UA Permissions
`LdapUserAuthenticator`:
All authenticated LDAP users can browse and read nodes regardless of group membership. Groups grant additional permissions:
1. Refuses to bind over plain-LDAP unless `AllowInsecureLdap = true` (dev/test only).
2. Connects to `Server:Port`, optionally upgrades to TLS (`UseTls = true`, port 636 for AD).
3. Binds as the service account; searches `SearchBase` for `UserNameAttribute = username`.
4. Rebinds as the resolved user DN with the supplied password (the actual credential check).
5. Reads `GroupAttribute` (default `memberOf`) and strips the leading `CN=` so operators configure friendly group names in `GroupToRole`.
6. Returns a `UserAuthResult` carrying the validated username + the set of LDAP groups. The set flows through to the session identity via `ILdapGroupsBearer.LdapGroups`.
| LDAP Group | Permission |
|---|---|
| ReadOnly | No additional permissions (read-only access) |
| WriteOperate | Write FreeAccess and Operate attributes |
| WriteTune | Write Tune attributes |
| WriteConfigure | Write Configure attributes |
| AlarmAck | Acknowledge alarms |
Users can belong to multiple groups. The `admin` user in the default GLAuth configuration belongs to all groups.
### Effective Permission Matrix
The effective permission for a write operation depends on two factors: the user's session role (from LDAP group membership or anonymous access) and the Galaxy attribute's security classification. The security classification controls the node's `AccessLevel` — attributes classified as `SecuredWrite`, `VerifiedWrite`, or `ViewOnly` are exposed as read-only nodes regardless of the user's role. For writable classifications, the required write role depends on the classification.
| | FreeAccess | Operate | SecuredWrite | VerifiedWrite | Tune | Configure | ViewOnly |
|---|---|---|---|---|---|---|---|
| **Anonymous (`AnonymousCanWrite=true`)** | Write | Write | Read | Read | Write | Write | Read |
| **Anonymous (`AnonymousCanWrite=false`)** | Read | Read | Read | Read | Read | Read | Read |
| **ReadOnly** | Read | Read | Read | Read | Read | Read | Read |
| **WriteOperate** | Write | Write | Read | Read | Read | Read | Read |
| **WriteTune** | Read | Read | Read | Read | Write | Read | Read |
| **WriteConfigure** | Read | Read | Read | Read | Read | Write | Read |
| **AlarmAck** (only) | Read | Read | Read | Read | Read | Read | Read |
| **Admin** (all groups) | Write | Write | Read | Read | Write | Write | Read |
All roles can browse and read all nodes. The "Read" entries above mean the node is either read-only by classification or the user lacks the required write role. "Write" means the write is permitted by both the node's classification and the user's role.
Alarm acknowledgment is an independent permission controlled by the `AlarmAck` role and is not affected by security classification.
### GLAuth Setup
The project uses [GLAuth](https://github.com/glauth/glauth) v2.4.0 as the LDAP server, installed at `C:\publish\glauth\`. See `C:\publish\glauth\auth.md` for the complete user/group reference and service management commands.
### Configuration
Enable LDAP in `appsettings.json` under `Authentication.Ldap`. See [Configuration Guide](Configuration.md) for the full property reference.
### Active Directory configuration
Production deployments typically point at Active Directory instead of GLAuth. Only four properties differ from the dev defaults: `Server`, `Port`, `UserNameAttribute`, and `ServiceAccountDn`. The same `GroupToRole` mechanism works — map your AD security groups to OPC UA roles.
Configuration example (Active Directory production):
```json
{
@@ -362,32 +129,169 @@ Production deployments typically point at Active Directory instead of GLAuth. On
"UseTls": true,
"AllowInsecureLdap": false,
"SearchBase": "DC=corp,DC=example,DC=com",
"ServiceAccountDn": "CN=OpcUaSvc,OU=Service Accounts,DC=corp,DC=example,DC=com",
"ServiceAccountDn": "CN=OtOpcUaSvc,OU=Service Accounts,DC=corp,DC=example,DC=com",
"ServiceAccountPassword": "<from your secret store>",
"DisplayNameAttribute": "displayName",
"GroupAttribute": "memberOf",
"UserNameAttribute": "sAMAccountName",
"GroupToRole": {
"OPCUA-Operators": "WriteOperate",
"OPCUA-Engineers": "WriteConfigure",
"OPCUA-AlarmAck": "AlarmAck",
"OPCUA-Tuners": "WriteTune"
"OPCUA-Tuners": "WriteTune",
"OPCUA-AlarmAck": "AlarmAck"
}
}
}
}
```
Notes:
`UserNameAttribute: "sAMAccountName"` is the critical AD override — the default `uid` is not populated on AD user entries. Use `userPrincipalName` instead if operators log in with `user@corp.example.com` form. Nested group membership is not expanded — assign users directly to the role-mapped groups, or pre-flatten in AD.
- `UserNameAttribute: "sAMAccountName"` is the critical AD override — the default `uid` is not populated on AD user entries, so the user-DN lookup returns no results without it. Use `userPrincipalName` instead if operators log in with `user@corp.example.com` form.
- `Port: 636` + `UseTls: true` is required under AD's LDAP-signing enforcement. AD increasingly rejects plain-LDAP bind; set `AllowInsecureLdap: false` to refuse fallback.
- `ServiceAccountDn` should name a dedicated read-only service principal — not a privileged admin. The account needs read access to user and group entries in the search base.
- `memberOf` values come back as full DNs like `CN=OPCUA-Operators,OU=OPC UA Security Groups,OU=Groups,DC=corp,DC=example,DC=com`. The authenticator strips the leading `CN=` RDN value so operators configure `GroupToRole` with readable group common-names.
- Nested group membership is **not** expanded — assign users directly to the role-mapped groups, or pre-flatten membership in AD. `LDAP_MATCHING_RULE_IN_CHAIN` / `tokenGroups` expansion is an authenticator enhancement, not a config change.
The same options bind the Admin's `LdapAuthService` (cookie auth / login form) so operators authenticate with a single credential across both processes.
### Security Considerations
---
- LDAP credentials are transmitted in plaintext over the OPC UA channel unless transport security is enabled. Use `Basic256Sha256-SignAndEncrypt` for production deployments.
- The GLAuth LDAP server itself listens on plain LDAP (port 3893). Enable LDAPS in `glauth.cfg` for environments where LDAP traffic crosses network boundaries.
- The service account password is stored in `appsettings.json`. Protect this file with appropriate filesystem permissions.
## Data-Plane Authorization
Data-plane authorization is the check run on every OPC UA operation against an OtOpcUa endpoint: *can this authenticated user Browse / Read / Subscribe / Write / HistoryRead / AckAlarm / Call on this specific node?*
Per decision #129 the model is **additive-only — no explicit Deny**. Grants at each hierarchy level union; absence of a grant is the default-deny.
### Hierarchy
ACLs are evaluated against the UNS path:
```
ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag
```
Each level can carry `NodeAcl` rows (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/NodeAcl.cs`) that grant a permission bundle to a set of `LdapGroups`.
### Permission flags
```csharp
[Flags]
public enum NodePermissions : uint
{
Browse = 1 << 0,
Read = 1 << 1,
Subscribe = 1 << 2,
HistoryRead = 1 << 3,
WriteOperate = 1 << 4,
WriteTune = 1 << 5,
WriteConfigure = 1 << 6,
AlarmRead = 1 << 7,
AlarmAcknowledge = 1 << 8,
AlarmConfirm = 1 << 9,
AlarmShelve = 1 << 10,
MethodCall = 1 << 11,
ReadOnly = Browse | Read | Subscribe | HistoryRead | AlarmRead,
Operator = ReadOnly | WriteOperate | AlarmAcknowledge | AlarmConfirm,
Engineer = Operator | WriteTune | AlarmShelve,
Admin = Engineer | WriteConfigure | MethodCall,
}
```
The three Write tiers map to Galaxy's v1 `SecurityClassification``FreeAccess`/`Operate``WriteOperate`, `Tune``WriteTune`, `Configure``WriteConfigure`. `SecuredWrite` / `VerifiedWrite` / `ViewOnly` classifications remain read-only from OPC UA regardless of grant.
### Evaluator — `PermissionTrie`
`src/ZB.MOM.WW.OtOpcUa.Core/Authorization/`:
| Class | Role |
|---|---|
| `PermissionTrie` | Cluster-scoped trie; each node carries `(GroupId → NodePermissions)` grants. |
| `PermissionTrieBuilder` | Builds a trie from the current `NodeAcl` rows in one pass. |
| `PermissionTrieCache` | Per-cluster memoised trie; invalidated via `AclChangeNotifier` when the Admin publishes a draft that touches ACLs. |
| `TriePermissionEvaluator` | Implements `IPermissionEvaluator.Authorize(session, operation, scope)` — walks from the root to the leaf for the supplied `NodeScope`, unions grants along the path, compares required permission to the union. |
`NodeScope` carries `(ClusterId, NamespaceId, AreaId, LineId, EquipmentId, TagId)`; any suffix may be null — a tag-level ACL is more specific than an area-level ACL but both contribute via union.
### Dispatch gate — `AuthorizationGate`
`src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` bridges the OPC UA stack's `ISystemContext.UserIdentity` to the evaluator. `DriverNodeManager` holds exactly one reference to it and calls `IsAllowed(identity, OpcUaOperation.*, NodeScope)` on every Read, Write, HistoryRead, Browse, Subscribe, AckAlarm, Call path. A false return short-circuits the dispatch with `BadUserAccessDenied`.
Key properties:
- **Driver-agnostic.** No driver-level code participates in authorization decisions. Drivers report `SecurityClassification` as metadata on tag discovery; everything else flows through `AuthorizationGate`.
- **Fail-open-during-transition.** `StrictMode = false` (default during ACL rollouts) lets sessions without resolved LDAP groups proceed; flip `Authorization:StrictMode = true` in production once ACLs are populated.
- **Evaluator stays pure.** `TriePermissionEvaluator` has no OPC UA stack dependency — it's tested directly from xUnit.
### Probe-this-permission (Admin UI)
`PermissionProbeService` (`src/ZB.MOM.WW.OtOpcUa.Admin/Services/PermissionProbeService.cs`) lets an operator ask "if a user with groups X, Y, Z asked to do operation O on node N, would it succeed?" The answer is rendered in the AclsTab "Probe" dialog — same evaluator, same trie, so the Admin UI answer and the live Server answer cannot disagree.
### Full model
See [`docs/v2/acl-design.md`](v2/acl-design.md) for the complete design: trie invalidation, flag semantics, per-path override rules, and the reasoning behind additive-only (no Deny).
---
## Control-Plane Authorization
Control-plane authorization governs **the Admin UI** — who can view fleet config, edit drafts, publish generations, manage cluster nodes + credentials.
Per decision #150 control-plane roles are **deliberately independent of data-plane ACLs**. An operator who can read every OPC UA tag in production may not be allowed to edit cluster config; conversely a ConfigEditor may not have any data-plane grants at all.
### Roles
`src/ZB.MOM.WW.OtOpcUa.Admin/Services/AdminRoles.cs`:
| Role | Capabilities |
|---|---|
| `ConfigViewer` | Read-only access to drafts, generations, audit log, fleet status. |
| `ConfigEditor` | ConfigViewer plus draft editing (UNS, equipment, tags, ACLs, driver instances, reservations, CSV imports). Cannot publish. |
| `FleetAdmin` | ConfigEditor plus publish, cluster/node CRUD, credential management, role-grant management. |
Policies registered in Admin `Program.cs`:
```csharp
builder.Services.AddAuthorizationBuilder()
.AddPolicy("CanEdit", p => p.RequireRole(AdminRoles.ConfigEditor, AdminRoles.FleetAdmin))
.AddPolicy("CanPublish", p => p.RequireRole(AdminRoles.FleetAdmin));
```
Razor pages and API endpoints gate with `[Authorize(Policy = "CanEdit")]` / `"CanPublish"`; nav-menu sections hide via `<AuthorizeView>`.
### Role grant source
Admin reads `LdapGroupRoleMapping` rows from the Config DB (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/LdapGroupRoleMapping.cs`) — the same pattern as the data-plane `NodeAcl` but scoped to Admin roles + (optionally) cluster scope for multi-site fleets. The `RoleGrants.razor` page lets FleetAdmins edit these mappings without leaving the UI.
---
## OTOPCUA0001 Analyzer — Compile-Time Guard
Per-capability resilience (retry, timeout, circuit-breaker, bulkhead) is applied by `CapabilityInvoker` in `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/`. A driver-capability call made **outside** the invoker bypasses resilience entirely — which in production looks like inconsistent timeouts, un-wrapped retries, and unbounded blocking.
`OTOPCUA0001` (Roslyn analyzer at `src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs`) fires as a compile-time **warning** when an `async`/`Task`-returning method on one of the seven guarded capability interfaces (`IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, `IHistoryProvider`) is invoked **outside** a lambda passed to `CapabilityInvoker.ExecuteAsync` / `ExecuteWriteAsync` / `AlarmSurfaceInvoker.*`. The analyzer walks up the syntax tree from the call site, finds any enclosing invoker invocation, and verifies the call lives transitively inside that invocation's anonymous-function argument — a sibling pattern (do the call, then invoke `ExecuteAsync` on something unrelated nearby) does not satisfy the rule.
Five xUnit-v3 + Shouldly tests at `tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests` cover the common fail/pass shapes + the sibling-pattern regression guard.
The rule is intentionally scoped to async surfaces — pure in-memory accessors like `IHostConnectivityProbe.GetHostStatuses()` return synchronously and do not require the invoker wrap.
---
## Audit Logging
- **Server**: Serilog `AUDIT:` prefix on every authentication success/failure, certificate validation result, write access denial. Written alongside the regular rolling file sink.
- **Admin**: `AuditLogService` writes `ConfigAuditLog` rows to the Config DB for every publish, rollback, cluster-node CRUD, credential rotation. Visible in the Audit page for operators with `ConfigViewer` or above.
---
## Troubleshooting
### Certificate trust failure
Check `{PkiStoreRoot}/rejected/` for the client's cert. Promote via Admin UI Certificates page, or copy the `.der` file manually to `trusted/`.
### LDAP users can connect but fail authorization
Verify (a) `OpcUaServer:Ldap:GroupAttribute` returns groups in the form `CN=MyGroup,…` (OtOpcUa strips the `CN=` for matching), (b) a `NodeAcl` grant exists at any level of the node's UNS path that unions to the required permission, (c) `Authorization:StrictMode` is correctly set for the deployment stage.
### LDAP bind rejected as "insecure"
Set `UseTls = true` + `Port = 636`, or temporarily flip `AllowInsecureLdap = true` in dev. Production Active Directory increasingly refuses plain-LDAP bind under LDAP-signing enforcement.
### `AuthorizationGate` denies every call after a publish
`AclChangeNotifier` invalidates the `PermissionTrieCache` on publish; a stuck cache is usually a missed notification. Restart the Server as a quick mitigation and file a bug — the design is to stay fresh without restarts.

View File

@@ -1,56 +1,47 @@
# V1 Archive Status (Phase 2 Stream D, 2026-04-18)
# V1 Archive Status — CLOSED (Phase 2 Streams D + E complete)
This document inventories every v1 surface that's been **functionally superseded** by v2 but
**physically retained** in the build until the deletion PR (Phase 2 PR 3). Rationale: cascading
references mean a single deletion is high blast-radius; archive-marking lets the v2 stack ship
on its own merits while the v1 surface stays as parity reference.
> **Status as of 2026-04-18: the v1 archive has been fully removed from the tree.**
> This document is retained as historical record of the Phase 2 Stream D / E closure.
## Archived projects
## Final state
| Path | Status | Replaced by | Build behavior |
|---|---|---|---|
| `src/ZB.MOM.WW.OtOpcUa.Host/` | Archive (executable in build) | `OtOpcUa.Server` + `Driver.Galaxy.Host` + `Driver.Galaxy.Proxy` | Builds; not deployed by v2 install scripts |
| `src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/` | Archive (plugin in build) | TODO: port into `Driver.Galaxy.Host/Backend/Historian/` (Task B.1.h follow-up) | Builds; loaded only by archived Host |
| `tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/` | Archive | `Driver.Galaxy.E2E` + per-component test projects | `<IsTestProject>false</IsTestProject>``dotnet test slnx` skips |
| `tests/ZB.MOM.WW.OtOpcUa.IntegrationTests/` | Archive | `Driver.Galaxy.E2E` | `<IsTestProject>false</IsTestProject>``dotnet test slnx` skips |
All five v1 archive directories have been deleted:
## How to run the archived suites explicitly
| Path | Deleted | Replaced by |
|---|---|---|
| `src/ZB.MOM.WW.OtOpcUa.Host/` | ✅ | `OtOpcUa.Server` + `Driver.Galaxy.Host` + `Driver.Galaxy.Proxy` |
| `src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/` | ✅ | `Driver.Galaxy.Host/Backend/Historian/` (ported in Phase 3 PRs 51-55) |
| `tests/ZB.MOM.WW.OtOpcUa.Historian.Aveva.Tests/` | ✅ | `Driver.Galaxy.Host.Tests/Historian/` |
| `tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/` | ✅ | Per-component `*.Tests` projects + `Driver.Galaxy.E2E` |
| `tests/ZB.MOM.WW.OtOpcUa.IntegrationTests/` | ✅ | `Driver.Galaxy.E2E` + `Driver.Modbus.IntegrationTests` |
```powershell
# v1 unit tests (494):
dotnet test tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive
## Closure timeline
# v1 integration tests (6):
dotnet test tests/ZB.MOM.WW.OtOpcUa.IntegrationTests
```
- **PR 2 (2026-04-18, phase-2-stream-d)** — archive-marked the four v1 projects with
`<IsTestProject>false</IsTestProject>` so solution builds and `dotnet test slnx` bypassed
them. Capture: `docs/v2/implementation/exit-gate-phase-2-final.md`.
- **Phase 3 PR 18 (2026-04-18)** — deleted the archived project source trees. Leftover
`bin/` and `obj/` residue remained on disk from pre-deletion builds.
- **Phase 2 PR 61 (2026-04-18, this closure PR)** — scrubbed the empty residue directories
and confirmed `dotnet build ZB.MOM.WW.OtOpcUa.slnx` clean with 0 errors.
Both still pass on this dev box — they're the parity reference for Phase 2 PR 3's deletion
decision.
## Parity validation (Stream E)
## Deletion plan (Phase 2 PR 3)
The original 494 v1 tests + 6 v1 integration tests are **not** preserved in the v2 branch.
Their parity-bar role is now filled by:
Pre-conditions:
- [ ] `Driver.Galaxy.E2E` test count covers the v1 IntegrationTests' 6 integration scenarios
at minimum (currently 7 tests; expand as needed)
- [ ] `Driver.Galaxy.Host/Backend/Historian/` ports the Wonderware Historian plugin
so `MxAccessGalaxyBackend.HistoryReadAsync` returns real data (Task B.1.h)
- [ ] Operator review on a separate PR — destructive change
Steps:
1. `git rm -r src/ZB.MOM.WW.OtOpcUa.Host/`
2. `git rm -r src/ZB.MOM.WW.OtOpcUa.Historian.Aveva/`
(or move it under Driver.Galaxy.Host first if the lift is part of the same PR)
3. `git rm -r tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/`
4. `git rm -r tests/ZB.MOM.WW.OtOpcUa.IntegrationTests/`
5. Edit `ZB.MOM.WW.OtOpcUa.slnx` — remove the four project lines
6. `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → confirm clean
7. `dotnet test ZB.MOM.WW.OtOpcUa.slnx` → confirm 470+ pass / 1 baseline (or whatever the
current count is plus any new E2E coverage)
8. Commit: "Phase 2 Stream D — delete v1 archive (Host + Historian.Aveva + v1Tests + IntegrationTests)"
9. PR 3 against `v2`, link this doc + exit-gate-phase-2-final.md
10. One reviewer signoff
- `Driver.Galaxy.E2E` — cross-FX subprocess parity (spawns the net48 x86 Galaxy.Host.exe
+ connects via real named pipe, exercises every `IDriver` capability through the
supervisor). Stability-findings regression tests (4 × 2026-04-13 findings) live here.
- Per-component `*.Tests` projects — cover the code that moved out of the monolith into
discrete v2 projects. Running `dotnet test ZB.MOM.WW.OtOpcUa.slnx` executes all of them
as one solution-level gate.
- `Driver.Modbus.IntegrationTests` — adds Modbus TCP driver coverage that didn't exist in
v1 (DL205, S7-1500, Mitsubishi MELSEC via pymodbus sim profiles — PRs 30, 56-60).
- Live-stack smoke tests (`Driver.Galaxy.E2E/LiveStack/`) — optional, gated on presence
of the `OtOpcUaGalaxyHost` service + Galaxy repository on the dev box (PRs 33, 36, 37).
## Rollback
If Phase 2 PR 3 surfaces downstream consumer regressions, `git revert` the deletion commit
restores the four projects intact. The v2 stack continues to ship from the v2 branch.
`git revert` of the deletion commits restores the projects intact. The v2 stack continues
to ship from the `v2` branch regardless.

View File

@@ -0,0 +1,151 @@
# Phase 6.1 — Resilience & Observability Runtime
> **Status**: **SHIPPED** 2026-04-19 — Streams A/B/C/D + E data layer merged to `v2` across PRs #78-82. Final exit-gate PR #83 turns the compliance script into real checks (all pass) and records this status update. One deferred piece: Stream E.2/E.3 SignalR hub + Blazor `/hosts` column refresh lands in a visual-compliance follow-up PR on the Phase 6.4 Admin UI branch.
>
> Baseline: 906 solution tests → post-Phase-6.1: 1042 passing (+136 net). One pre-existing Client.CLI Subscribe flake unchanged.
>
> **Branch**: `v2/phase-6-1-resilience-observability`
> **Estimated duration**: 3 weeks
> **Predecessor**: Phase 5 (drivers) — partial; S7 + OPC UA Client shipped, AB/TwinCAT/FOCAS paused
> **Successor**: Phase 6.2 (Authorization runtime)
## Phase Objective
Land the cross-cutting runtime protections + operability features that `plan.md` + `driver-stability.md` specify by decision but that no driver-phase actually wires. End-state: every driver goes through the same Polly resilience layer, health endpoints render the live driver fleet, structured logs carry per-request correlation IDs, and the config substrate survives a central DB outage via a LiteDB local cache.
Closes these gaps flagged in the 2026-04-19 audit:
1. Polly v8 resilience pipelines wired to every `IDriver` capability (no-op per-driver today; Galaxy has a hand-rolled `CircuitBreaker` only).
2. Tier A/B/C enforcement at runtime — `driver-stability.md` §24 and decisions #6373 define memory watchdog, bounded queues, scheduled recycle, wedge detection; `MemoryWatchdog` exists only inside `Driver.Galaxy.Host`.
3. Health endpoints (`/healthz`, `/readyz`) on `OtOpcUa.Server`.
4. Structured Serilog with per-request correlation IDs (driver instance, OPC UA session, IPC call).
5. LiteDB local cache + Polly retry + fallback on central-DB outage (decision #36).
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Core` → new `Core.Resilience` sub-namespace | Shared Polly pipeline builder (`DriverResiliencePipelines`). **Pipeline key = `(DriverInstanceId, HostName)`** so one dead PLC behind a multi-device driver doesn't open the breaker for healthy siblings (decision #35 per-device isolation). **Per-capability policy** — Read / HistoryRead / Discover / Probe / Alarm get retries; **Write does NOT** unless `[WriteIdempotent]` on the tag definition (decisions #44-45). |
| Every capability-interface consumer in the server | Wrap `IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`. Composition: timeout → (retry when capability supports) → circuit breaker → bulkhead. |
| `Core.Abstractions` → new `WriteIdempotentAttribute` | Marker on `ModbusTagDefinition` / `S7TagDefinition` / `OpcUaClientDriver` tag rows; opts that tag into auto-retry on Write. Absence = no retry, per spec. |
| `Core` → new `Core.Stability` sub-namespace — **split** | Two separate subsystems: (a) **`MemoryTracking`** runs all tiers; captures baseline (median of first 5 min `GetMemoryFootprint` samples) + applies the hybrid rule `soft = max(multiplier × baseline, baseline + floor)`; soft breach logs + surfaces to Admin; never kills. (b) **`MemoryRecycle`** (Tier C only — requires out-of-process topology) handles hard-breach recycle via the Proxy-side supervisor. Tier A/B overrun escalates to Tier C promotion ticket, not auto-kill. |
| `ScheduledRecycleScheduler` | Tier C only per decisions #73-74. Weekly/time-of-day recycle via Proxy supervisor. Tier A/B opt-in recycle lands in a future phase together with a Tier-C-escalation workflow. |
| `WedgeDetector` | **Demand-aware**: flips a driver to Faulted only when `(hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` derives from non-zero Polly bulkhead depth OR ≥1 active MonitoredItem OR ≥1 queued historian read. Idle + subscription-only drivers stay Healthy. |
| `DriverTypeRegistry` | Each driver type registers its `DriverTier` {A, B, C}. Tier C drivers must advertise their out-of-process topology; the registry enforces invariants (Tier C has a `Proxy` + `Host` pair). |
| `Driver.Galaxy.Proxy/Supervisor/` | **Retains** existing `CircuitBreaker` + `Backoff` — they guard IPC respawn (decision #68), different concern from the per-call Polly layer. Only `HeartbeatMonitor` is referenced downstream (IPC liveness). |
| `OtOpcUa.Server` → Minimal API endpoints on `http://+:4841` | `/healthz` = process alive + (config DB reachable OR `UsingStaleConfig=true`). `/readyz` = ANDed driver health; state-machine per `DriverState`: `Unknown`/`Initializing` → 503, `Healthy` → 200, `Degraded` → 200 + `{degradedDrivers: [...]}` in body, `Faulted` → 503. JSON body always reports per-instance detail. |
| Serilog configuration | Centralize enrichers in `OtOpcUa.Server/Observability/LogContextEnricher.cs`. Every capability call runs inside a `LogContext.PushProperty` scope with {DriverInstanceId, DriverType, CapabilityName, CorrelationId (UA RequestHandle or internal GUID)}. Sink config stays rolling-file per CLAUDE.md; JSON sink added alongside plain-text (switchable via `Serilog:WriteJson` appsetting). |
| `Configuration` project | Add `LiteDbConfigCache` adapter. **Generation-sealed snapshots**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file. Reads serve the last-known-sealed generation; mixed-generation reads are impossible. Write path bypasses cache + fails hard on DB outage. Pipeline: timeout (2 s) → retry (3×, jittered) → fallback-to-sealed-snapshot. |
| `DriverHostStatus` vs. `DriverInstanceResilienceStatus` | New separate entity `DriverInstanceResilienceStatus { DriverInstanceId, HostName, LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth, LastRecycleUtc, BaselineFootprintBytes }`. `DriverHostStatus` keeps per-host connectivity only; Admin `/hosts` joins both for display. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Driver wire protocols | Resilience is a server-side wrapper; individual drivers don't see Polly. Their existing retry logic (ModbusTcpTransport reconnect, SessionReconnectHandler) stays in place as inner layers. |
| Config DB schema | LiteDB cache is a read-only mirror; no new central tables except `DriverHostStatus` column additions. |
| OPC UA wire behavior visible to clients | Health endpoints live on a separate HTTP port (4841 by convention); the OPC UA server on 4840 is unaffected. |
| The four 2026-04-13 Galaxy stability findings | Already closed in Phase 2. Phase 6.1 *generalises* the pattern, doesn't re-fix Galaxy. |
| Driver-layer SafeHandle usage | Existing Galaxy `SafeMxAccessHandle` + Modbus `TcpClient` disposal stay — they're driver-internal, not part of the cross-cutting layer. |
## Entry Gate Checklist
- [ ] Phases 05 exit gates cleared (or explicitly deferred with task reference)
- [ ] `driver-stability.md` §24 re-read; decisions #6373 + #3436 re-skimmed
- [ ] Polly v8 NuGet available (`Microsoft.Extensions.Resilience` + `Polly.Core`) — verify package restore before task breakdown
- [ ] LiteDB 5.x NuGet confirmed MIT + actively maintained
- [ ] Existing drivers catalogued: Galaxy.Proxy, Modbus, S7, OpcUaClient — confirm test counts baseline so the resilience layer doesn't regress any
- [ ] Serilog configuration inventory: locate every `Log.ForContext` call site that will need `LogContext` rewrap
- [ ] Admin `/hosts` page's current `DriverHostStatus` consumption reviewed so the schema extensions don't break it
## Task Breakdown
### Stream A — Resilience layer (1 week)
1. **A.1** Add `Polly.Core` + `Microsoft.Extensions.Resilience` to `Core`. Build `DriverResiliencePipelineBuilder` — key on `(DriverInstanceId, HostName)`; composes Timeout → (Retry when the capability allows it; skipped for Write unless `[WriteIdempotent]`) → CircuitBreaker → Bulkhead. Per-capability policy map documented in `DriverResilienceOptions.CapabilityPolicies`.
2. **A.2** `DriverResilienceOptions` record bound from `DriverInstance.ResilienceConfig` JSON column (new nullable). **Per-tier × per-capability** defaults: Tier A (OpcUaClient, S7) Read 3 retries/2 s/5-failure-breaker, Write 0 retries/2 s/5-failure-breaker; Tier B (Modbus) Read 3/4 s/5, Write 0/4 s/5; Tier C (Galaxy) Read 1 retry/10 s/no-kill, Write 0/10 s/no-kill. Idempotent writes can opt into Read-shaped retry via the attribute.
3. **A.3** `CapabilityInvoker<TCapability, TResult>` wraps every method on the capability interfaces (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ITagDiscovery.DiscoverAsync`, `ISubscribable.SubscribeAsync/UnsubscribeAsync`, `IHostConnectivityProbe` probe loop, `IAlarmSource.SubscribeAlarmsAsync/AcknowledgeAsync`, `IHistoryProvider.ReadRawAsync/ReadProcessedAsync/ReadAtTimeAsync/ReadEventsAsync`). Existing server-side dispatch routes through it.
4. **A.4** **Retain** `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard IPC process respawn (decision #68), orthogonal to the per-call Polly layer. Only `HeartbeatMonitor` is consumed outside the supervisor.
5. **A.5** Unit tests: per-policy, per-composition. Negative integration tests: (a) Modbus FlakeyTransport fails 5× on Read, succeeds 6th — invoker surfaces success; (b) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=false` — invoker surfaces failure without retry (no duplicate pulse); (c) Modbus FlakeyTransport fails 1× on Write with `[WriteIdempotent]=true` — invoker retries. Bench: no-op overhead < 1%.
6. **A.6** `WriteIdempotentAttribute` in `Core.Abstractions`. Modbus/S7/OpcUaClient tag-definition records pick it up; invoker reads via reflection once at driver init.
### Stream B — Tier A/B/C stability runtime — split into MemoryTracking + MemoryRecycle (1 week)
1. **B.1** `Core.Abstractions``DriverTier` enum {A, B, C}. Extend `DriverTypeRegistry` to require `DriverTier` at registration. Existing driver types stamped (Galaxy = C, Modbus = B, S7 = B, OpcUaClient = A).
2. **B.2** **`MemoryTracking`** (all tiers) lifted from `Driver.Galaxy.Host/MemoryWatchdog.cs`. Captures `BaselineFootprintBytes` as the median of first 5 min of `IDriver.GetMemoryFootprint()` samples post-`InitializeAsync`. Applies **decision #70 hybrid formula**: `soft = max(multiplier × baseline, baseline + floor)`; Tier A multiplier=3, floor=50 MB; Tier B multiplier=3, floor=100 MB; Tier C multiplier=2, floor=500 MB. Soft breach → log + `DriverInstanceResilienceStatus.CurrentFootprint` tick; never kills. Hard = 2 × soft.
3. **B.3** **`MemoryRecycle`** (Tier C only per decisions #73-74). Hard-breach on a Tier C driver triggers `ScheduledRecycleScheduler.RequestRecycleNow(driverInstanceId)`; scheduler proxies to `Driver.Galaxy.Proxy/Supervisor/` which restarts the Host process. Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; **never auto-kills** the in-process driver.
4. **B.4** **`ScheduledRecycleScheduler`** per decision #67: Tier C driver instances opt-in to a weekly recycle at a configured cron. Tier A/B scheduled recycle deferred to a later phase paired with Tier-C escalation.
5. **B.5** **`WedgeDetector`** demand-aware: `if (state==Healthy && hasPendingWork && noProgressIn > WedgeThreshold) → force ReinitializeAsync`. `hasPendingWork` = (bulkhead depth > 0) OR (active monitored items > 0) OR (queued historian-read count > 0). `WedgeThreshold` default 5 × PublishingInterval, min 60 s. Idle driver stays Healthy.
6. **B.6** Tests: tracking unit tests drive synthetic allocation against a fake `GetMemoryFootprint`; recycle tests use a mock supervisor; wedge tests include the false-fault cases — idle subscriber, slow historian backfill, write-only burst.
### Stream C — Health endpoints + structured logging (4 days)
1. **C.1** `OtOpcUa.Server/Observability/HealthEndpoints.cs` — Minimal API on a second Kestrel binding (default `http://+:4841`). `/healthz` reports process uptime + config-DB reachability (or cache-warm). `/readyz` enumerates `DriverInstance` rows + reports each driver's `DriverHealth.State`; returns 503 if ANY driver is Faulted. JSON body per `docs/v2/acl-design.md` §"Operator Dashboards" shape.
2. **C.2** `LogContextEnricher` installed at Serilog config time. Every driver-capability call site wraps its body in `using (LogContext.PushProperty("DriverInstanceId", id)) using (LogContext.PushProperty("CorrelationId", correlationId))`. Correlation IDs: reuse OPC UA `RequestHeader.RequestHandle` when in-flight; otherwise generate `Guid.NewGuid().ToString("N")[..12]`.
3. **C.3** Add JSON-formatted Serilog sink alongside the existing rolling-file plain-text sink so SIEMs (Splunk, Datadog) can ingest without a regex parser. Sink switchable via `Serilog:WriteJson` appsetting.
4. **C.4** Integration test: boot server, issue Modbus read, assert log line contains `DriverInstanceId` + `CorrelationId` structured fields.
### Stream D — Config DB LiteDB fallback — generation-sealed snapshots (1 week)
1. **D.1** `LiteDbConfigCache` adapter backed by **sealed generation snapshots**: each successful `sp_PublishGeneration` writes `<cache-root>/<clusterId>/<generationId>.db` as read-only after commit. The adapter maintains a `CurrentSealedGenerationId` pointer updated atomically on successful publish. Mixed-generation reads are **impossible** — every read served from the cache serves one coherent sealed generation.
2. **D.2** Write-path queries (draft save, publish) bypass the cache entirely and fail hard on DB outage. Read-path queries (DriverInstance enumeration, LdapGroupRoleMapping, cluster + namespace metadata) go through the pipeline: timeout 2 s → retry 3× jittered → fallback to the current sealed snapshot.
3. **D.3** `UsingStaleConfig` flag flips true when a read fell back to the sealed snapshot; cleared on the next successful DB round-trip. Surfaced on `/healthz` body and Admin `/hosts`.
4. **D.4** Tests: (a) SQL-container kill mid-operation — read returns sealed snapshot, `UsingStaleConfig=true`, driver stays Healthy; (b) mixed-generation guard — attempt to serve partial generation by corrupting a snapshot file mid-read → adapter fails closed rather than serving mixed data; (c) first-boot-no-snapshot case — adapter refuses to start, driver fails `InitializeAsync` with a clear config-DB-required error.
### Stream E — Admin `/hosts` page refresh (3 days)
1. **E.1** Extend `DriverHostStatus` schema with Stream A resilience columns. Generate EF migration.
2. **E.2** `Admin/FleetStatusHub` SignalR hub pushes `LastCircuitBreakerOpenUtc` + `CurrentBulkheadDepth` + `LastRecycleUtc` on change.
3. **E.3** `/hosts` Blazor page renders new columns; red badge if `ConsecutiveFailures > breakerThreshold / 2`.
## Compliance Checks (run at exit gate)
- [ ] **Invoker coverage**: every method on `IReadable` / `IWritable` / `ITagDiscovery` / `ISubscribable` / `IHostConnectivityProbe` / `IAlarmSource` / `IHistoryProvider` in the server dispatch layer routes through `CapabilityInvoker`. Enforce via a Roslyn analyzer (error-level; warning-first is rejected — the compliance check is the gate).
- [ ] **Write-retry guard**: writes without `[WriteIdempotent]` never get retried. Unit-test the invoker path asserts zero retry attempts.
- [ ] **Pipeline isolation**: pipeline key is `(DriverInstanceId, HostName)`. Integration test with two Modbus hosts under one instance — failing host A does not open the breaker for host B.
- [ ] **Tier registry**: every driver type registered in `DriverTypeRegistry` has a non-null `Tier`. Unit test walks the registry + asserts no gaps. Tier C registrations must declare their out-of-process topology.
- [ ] **MemoryTracking never kills**: soft/hard breach tests on a Tier A/B driver log + surface without terminating the process.
- [ ] **MemoryRecycle Tier C only**: hard breach on a Tier A driver never invokes the supervisor; on Tier C it does.
- [ ] **Wedge demand-aware**: test suite includes idle-subscription-only, slow-historian-backfill, and write-only-burst cases — driver stays Healthy.
- [ ] **Galaxy supervisor preserved**: `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` still present + still invoked on Host crash.
- [ ] **Health state machine**: `/healthz` + `/readyz` respond within 500 ms for every `DriverState`; state-machine table in this doc drives the test matrix.
- [ ] **Structured log**: CI grep asserts at least one log line per capability call has `"DriverInstanceId"` + `"CorrelationId"` JSON fields.
- [ ] **Generation-sealed cache**: integration tests cover (a) SQL-kill mid-operation serves last-sealed snapshot; (b) mixed-generation corruption fails closed; (c) first-boot no-snapshot + DB-down → `InitializeAsync` fails with clear error.
- [ ] No regression in existing test suites — `dotnet test ZB.MOM.WW.OtOpcUa.slnx` count equal-or-greater than pre-Phase-6.1 baseline.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Polly pipeline adds per-request latency on hot path | Medium | Medium | Benchmark Stream A.5 before merging; 1 % overhead budget; inline hot path short-circuits when retry count = 0 |
| LiteDB cache diverges from central DB | Medium | High | Stale-data banner in Admin UI; `UsingStaleConfig` flag surfaced on `/readyz`; cache refresh on every successful DB round-trip; 24-hour synthetic warning |
| Tier watchdog false-positive-kills a legitimate batch load | Low | High | Soft/hard threshold split; soft only logs; hard triggers recycle; thresholds configurable per-instance |
| Wedge detector races with slow-but-healthy drivers | Medium | High | Minimum 60 s threshold; detector only activates if driver claims `Healthy`; add circuit-breaker feedback so rapid oscillation trips instead of thrashing |
| Roslyn analyzer breaks external driver authors | Low | Medium | Release analyzer as warning-level initially; upgrade to error in Phase 6.1+1 after one release cycle |
## Completion Checklist
- [ ] Stream A: Polly shared pipeline + per-tier defaults + driver-capability invoker + tests
- [ ] Stream B: Tier registry + generalised watchdog + scheduled recycle + wedge detector
- [ ] Stream C: `/healthz` + `/readyz` + structured logging + JSON Serilog sink
- [ ] Stream D: LiteDB cache + Polly fallback in Configuration
- [ ] Stream E: Admin `/hosts` page refresh
- [ ] Cross-cutting: `phase-6-1-compliance.ps1` exits 0; full solution `dotnet test` passes; exit-gate doc recorded
## Adversarial Review — 2026-04-19 (Codex, thread `019da489-e317-7aa1-ab1f-6335e0be2447`)
Plan substantially rewritten before implementation to address these findings. Each entry: severity · verdict · adjustment.
1. **Crit · ACCEPT** — Auto-retry collides with decisions #44/#45 (no auto-write-retry; opt-in via `WriteIdempotent` + CAS). Pipeline now **capability-specific**: Read/HistoryRead/Discover/Probe/Alarm-subscribe all get retries; **Write does not** unless the tag metadata carries `WriteIdempotent=true`. New `WriteIdempotentAttribute` surfaces on `ModbusTagDefinition` / `S7TagDefinition` / etc.
2. **Crit · ACCEPT** — "One pipeline per driver instance" breaks decision #35's per-device isolation. **Change**: pipeline key is `(DriverInstanceId, HostName)` not just `DriverInstanceId`. One dead PLC behind a multi-device Modbus driver no longer opens the breaker for healthy siblings.
3. **Crit · ACCEPT** — Memory watchdog + scheduled recycle at Tier A/B breaches decisions #73/#74 (process-kill protections are Tier-C-only). **Change**: Stream B splits into two — `MemoryTracking` (all tiers, soft/hard thresholds log + surface to Admin `/hosts`; never kills) and `MemoryRecycle` (Tier C only, requires out-of-process topology). Tier A/B overrun paths escalate to Tier C via a future PR, not auto-kill.
4. **High · ACCEPT** — Removing Galaxy's hand-rolled `CircuitBreaker` drops decision #68 host-supervision crash-loop protection. **Change**: keep `Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs` + `Backoff.cs` — they guard the IPC *process* re-spawn, not the per-call data path. Data-path Polly is an orthogonal layer.
5. **High · ACCEPT** — Roslyn analyzer targeting `IDriver` misses the hot paths (`IReadable.ReadAsync`, `IWritable.WriteAsync`, `ISubscribable.SubscribeAsync` etc.). **Change**: analyzer rule now matches every method on the capability interfaces; compliance doc enumerates the full call-site list.
6. **High · ACCEPT**`/healthz` + `/readyz` under-specified for degraded-running. **Change**: add a state-matrix sub-section explicitly covering `Unknown` (pre-init: `/readyz` 503), `Initializing` (503), `Healthy` (200), `Degraded` (200 with JSON body flagging the degraded driver; `/readyz` is OR across drivers), `Faulted` (503), plus cached-config-serving (`/healthz` returns 200 + `UsingStaleConfig: true` in JSON body).
7. **High · ACCEPT**`WedgeDetector` based on "no successful Read" false-fires on write-only subscriptions + idle systems. **Change**: wedge criteria now `(hasPendingWork AND noProgressIn > threshold)` where `hasPendingWork` comes from the Polly bulkhead depth + active MonitoredItem count. Idle driver stays Healthy.
8. **High · ACCEPT** — LiteDB cache serving mixed-generation reads breaks publish atomicity. **Change**: cache is snapshot-per-generation. Each published generation writes a sealed snapshot into `<cache-root>/<cluster>/<generationId>.db`; reads serve the last-known-sealed generation and never mix. Central DB outage during a *publish* means that publish fails (write path doesn't use cache); reads continue from the prior sealed snapshot.
9. **Med · ACCEPT**`DriverHostStatus` schema conflates per-host connectivity with per-driver-instance resilience counters. **Change**: new `DriverInstanceResilienceStatus` table separate from `DriverHostStatus`. Admin `/hosts` joins both for display.
10. **Med · ACCEPT** — Compliance says analyzer-error; risks say analyzer-warning. **Change**: phase 6.1 ships at **error** level (this phase is the gate); warning-mode option removed.
11. **Med · ACCEPT** — Hardcoded per-tier MB bands ignore decision #70's `max(multiplier × baseline, baseline + floor)` formula with observed-baseline capture. **Change**: watchdog captures baseline at post-init plateau (median of first 5 min GetMemoryFootprint samples) + applies the hybrid formula. Tier constants now encode the multiplier + floor, not raw MB.
12. **Med · ACCEPT** — Tests mostly cover happy path. **Change**: Stream A.5 adds negative tests for duplicate-write-replay-under-timeout; Stream B.5 adds false-wedge-on-idle-subscription + false-wedge-on-slow-historic-backfill; Stream D.4 adds mixed-generation cache test + corrupt-first-boot cache test.

View File

@@ -0,0 +1,153 @@
# Phase 6.2 — Authorization Runtime (ACL + LDAP grants)
> **Status**: **SHIPPED (core)** 2026-04-19 — Streams A, B, C (foundation), D (data layer) merged to `v2` across PRs #84-87. Final exit-gate PR #88 turns the compliance stub into real checks (all pass, 2 deferred surfaces tracked).
>
> Deferred follow-ups (tracked separately):
> - Stream C dispatch wiring on the 11 OPC UA operation surfaces (task #143).
> - Stream D Admin UI — RoleGrantsTab, AclsTab Probe-this-permission, SignalR invalidation, draft-diff ACL section + visual-compliance reviewer signoff (task #144).
>
> Baseline pre-Phase-6.2: 1042 solution tests → post-Phase-6.2 core: 1097 passing (+55 net). One pre-existing Client.CLI Subscribe flake unchanged.
>
> **Branch**: `v2/phase-6-2-authorization-runtime`
> **Estimated duration**: 2.5 weeks
> **Predecessor**: Phase 6.1 (Resilience & Observability) — reuses the Polly pipeline for ACL-cache refresh retries
> **Successor**: Phase 6.3 (Redundancy)
## Phase Objective
Wire ACL enforcement on every OPC UA Read / Write / Subscribe / Call path + LDAP group → admin role grants that the v2 plan specified but never ran. End-state: a user's effective permissions resolve through a per-session permission-trie over the 6-level `Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag` hierarchy, cached per session, invalidated on generation-apply + LDAP group expiry.
Closes these gaps:
1. **Data-path ACL enforcement**`NodeAcl` table + `NodePermissions` flags shipped; `NodeAclService.cs` present as a CRUD surface; no code consults ACLs at `Read`/`Write` time. OPC UA server answers everything to everyone.
2. **`LdapGroupRoleMapping` for cluster-scoped admin grants** — decision #105 shipped as the *design*; admin roles are hardcoded (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) with no cluster-scoping and no LDAP-to-grant table. Decision #105 explicitly lifts this from v2.1 into v2.0.
3. **Explicit Deny pathway** — deferred to v2.1 (decision #129 note). Phase 6.2 ships *grants only*; `Deny` stays out.
4. **Admin UI ACL grant editor**`AclsTab.razor` exists but edits the now-unused `NodeAcl` table; needs to wire to the runtime evaluator + the new `LdapGroupRoleMapping` table.
## Scope — What Changes
**Architectural separation** (critical for correctness): `LdapGroupRoleMapping` is **control-plane only** — it maps LDAP groups to Admin UI roles (`FleetAdmin` / `ConfigEditor` / `ReadOnly`) and cluster scopes for Admin access. **It is NOT consulted by the OPC UA data-path evaluator.** The data-path evaluator reads `NodeAcl` rows joined directly against the session's **resolved LDAP group memberships**. The two concerns share zero runtime code path.
| Concern | Change |
|---------|--------|
| `Configuration` project | New entity `LdapGroupRoleMapping { Id, LdapGroup, Role, ClusterId? (nullable = system-wide), IsSystemWide, GeneratedAtUtc }`. **Consumed only by Admin UI role routing.** Migration. Admin CRUD. |
| `Core` → new `Core.Authorization` sub-namespace | `IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, OpcUaOperation op, NodeId nodeId) → AuthorizationDecision`. `op` covers every OPC UA surface: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve. Result is tri-state (internal model distinguishes `Allow` / `NotGranted` / `Denied` + carries matched-grant provenance). Phase 6.2 only produces `Allow` + `NotGranted`; v2.1 Deny lands without API break. |
| `PermissionTrieBuilder` | Builds trie from `NodeAcl` rows joined against **resolved LDAP group memberships**, keyed on 6-level scope hierarchy for Equipment namespaces. **SystemPlatform namespaces (Galaxy)** use a `FolderSegment` scope level between Namespace and Tag, populated from `Tag.FolderPath` segments, so folder subtree authorization works on Galaxy trees the same way UNS works on Equipment trees. Trie node carries `ScopeKind` enum. |
| `PermissionTrieCache` + freshness | One trie per `(ClusterId, GenerationId)`. Invalidated on `sp_PublishGeneration` via in-process event bus AND generation-ID check on hot path — every authz call looks up `CurrentGenerationId` (Polly-wrapped, sub-second cache); a Backup that cached a stale generation detects the mismatch + forces re-load. **Redundancy-safe**. |
| `UserAuthorizationState` freshness | Cached per session BUT bounded by `MembershipFreshnessInterval` (default **15 min**). Past that, the next hot-path authz call re-resolves LDAP group memberships via `LdapGroupService`. Failure to re-resolve (LDAP unreachable) → **fail-closed**: evaluator returns `NotGranted` for every call until memberships refresh successfully. Decoupled from Phase 6.1's availability-oriented 24h cache. |
| `AuthCacheMaxStaleness` | Separate from Phase 6.1's `UsingStaleConfig` window. Default 5 min — beyond that, authz fails closed regardless of Phase 6.1 cache warmth. |
| OPC UA server dispatch — all enforcement surfaces | `DriverNodeManager` wires evaluator on: **Browse + TranslateBrowsePathsToNodeIds** (ancestors implicitly visible if any descendant has a grant; denied ancestors filter from results), **Read** (per-attribute StatusCode `BadUserAccessDenied` in mixed-authorization batches; batch never poisons), **Write** (uses `NodePermissions.WriteOperate/Tune/Configure` based on driver `SecurityClassification`), **HistoryRead** (uses `NodePermissions.HistoryRead`**distinct** flag, not Read), **HistoryUpdate** (`NodePermissions.HistoryUpdate`), **CreateMonitoredItems** (per-`MonitoredItemCreateResult` denial), **TransferSubscriptions** (re-evaluates items on transfer), **Call** (`NodePermissions.MethodCall`), **Acknowledge/Confirm/Shelve** (per-alarm flags). |
| Subscription re-authorization | Each `MonitoredItem` is stamped with `(AuthGenerationId, MembershipVersion)` at create time. On every Publish, items with a stamp mismatching the session's current `(AuthGenerationId, MembershipVersion)` get re-evaluated; revoked items drop to `BadUserAccessDenied` within one publish cycle. Unchanged items stay fast-path. |
| `LdapAuthService` | On cookie-auth success: resolves LDAP group memberships; loads matching `LdapGroupRoleMapping` rows → role claims + cluster-scope claims (control plane); stores `UserAuthorizationState.LdapGroups` on the session for the data-plane evaluator. |
| `ValidatedNodeAclAuthoringService` | Replaces CRUD-only `NodeAclService` for authoring. Validates (LDAP group exists, scope exists in current or target draft, grant shape is valid, no duplicate `(LdapGroup, Scope)` pair). Admin UI writes only through it. |
| Admin UI `AclsTab.razor` | Writes via `ValidatedNodeAclAuthoringService`. Adds Probe-This-Permission row that runs the real evaluator against a chosen `(LDAP group, node, operation)` and shows `Allow` / `NotGranted` + matched-grant provenance. |
| Admin UI new tab `RoleGrantsTab.razor` | CRUD over `LdapGroupRoleMapping`. Per-cluster + system-wide grants. FleetAdmin only. **Documentation explicit** that this only affects Admin UI access, not OPC UA data plane. |
| Audit log | Every Grant/Revoke/Publish on `LdapGroupRoleMapping` or `NodeAcl` writes an `AuditLog` row with old/new state + user. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn | Already done (PR 19 LDAP user identity + Basic256Sha256 profile). Phase 6.2 is authorization only. |
| Explicit `Deny` grants | Decision #129 note explicitly defers to v2.1. Default-deny + additive grants only. |
| Driver-side `SecurityClassification` metadata | Drivers keep reporting `Operate` / `ViewOnly` / etc. — the evaluator uses them as *part* of the decision but doesn't replace them. |
| Galaxy namespace (SystemPlatform kind) | UNS levels don't apply; evaluator treats Galaxy nodes as `Cluster → Namespace → Tag` (skip UnsArea/UnsLine/Equipment). |
## Entry Gate Checklist
- [ ] Phase 6.1 merged (reuse `Core.Resilience` Polly pipeline for the ACL cache-refresh retries)
- [ ] `acl-design.md` re-read in full
- [ ] Decision log #105, #129, corrections-doc B1 re-skimmed
- [ ] Existing `NodeAcl` + `NodePermissions` flag enum audited; confirm bitmask flags match `acl-design.md` table
- [ ] Existing `LdapAuthService` group-resolution code path traced end-to-end — confirm it already queries group memberships (we only need the caller to consume the result)
- [ ] Test DB scenarios catalogued: two clusters, three LDAP groups per cluster, mixed grant shapes; captured as seed-data fixtures
## Task Breakdown
### Stream A — `LdapGroupRoleMapping` table + migration (3 days)
1. **A.1** Entity + EF Core migration. Columns per §Scope table. Unique constraint on `(LdapGroup, ClusterId)` with null-tolerant comparer for the system-wide case. Index on `LdapGroup` for the hot-path lookup on auth.
2. **A.2** `ILdapGroupRoleMappingService` CRUD. Wrap in the Phase 6.1 Polly pipeline (timeout → retry → fallback-to-cache).
3. **A.3** Seed-data migration: preserve the current hardcoded `FleetAdmin` / `ConfigEditor` / `ReadOnly` mappings by seeding rows for the existing LDAP groups the dev box uses (`cn=fleet-admin,…`, `cn=config-editor,…`, `cn=read-only,…`). Op no-op migration for existing deployments.
### Stream B — Permission-trie evaluator (1 week)
1. **B.1** `IPermissionEvaluator.Authorize(IEnumerable<Claim> identity, NodeId nodeId, NodePermissions needed)` — returns `bool`. Phase 6.2 returns only `true` / `false`; v2.1 can widen to `Allow`/`Deny`/`Indeterminate` if Deny lands.
2. **B.2** `PermissionTrieBuilder` builds the trie from `NodeAcl` + `LdapGroupRoleMapping` joined to the current generation's `UnsArea` + `UnsLine` + `Equipment` + `Tag` tables. One trie per `(ClusterId, GenerationId)` so rollback doesn't smear permissions across generations.
3. **B.3** Trie node structure: `{ Level: enum, ScopeId: Guid, AllowedPermissions: NodePermissions, ChildrenByLevel: Dictionary<Guid, TrieNode> }`. Evaluation walks from Cluster → Namespace → UnsArea → UnsLine → Equipment → Tag, ORing allowed permissions at each level. Additive semantics: a grant at Cluster level cascades to every descendant tag.
4. **B.4** `PermissionTrieCache` service scoped as singleton; exposes `GetTrieAsync(ClusterId, ct)` that returns the current-generation trie. Invalidated on `sp_PublishGeneration` via an in-process event bus; also on TTL expiry (24 h safety net).
5. **B.5** Per-session cached evaluator: OPC UA Session authentication produces `UserAuthorizationState { ClusterId, LdapGroups[], Trie }`; cached on the session until session close or generation-apply.
6. **B.6** Unit tests: trie-walk theory covering (a) Cluster-level grant cascades to tags, (b) Equipment-level grant doesn't leak to sibling Equipment, (c) multi-group union, (d) no-grant → deny, (e) Galaxy nodes skip UnsArea/UnsLine levels.
### Stream C — OPC UA server dispatch wiring (6 days, widened)
1. **C.1** `DriverNodeManager.Read` — evaluator consulted per `ReadValueId` with `OpcUaOperation.Read`. Denied attributes get `BadUserAccessDenied` per-item; batch never poisons. Integration test covers mixed-authorization batch (3 authorized + 2 denied → 3 Good values + 2 Bad StatusCodes, request completes).
2. **C.2** `DriverNodeManager.Write` — evaluator chooses `NodePermissions.WriteOperate` / `WriteTune` / `WriteConfigure` based on the driver-reported `SecurityClassification`.
3. **C.3** `DriverNodeManager.HistoryRead`**uses `NodePermissions.HistoryRead`**, which is a **distinct flag** from Read. Test: user with Read but not HistoryRead can read live values but gets `BadUserAccessDenied` on `HistoryRead`.
4. **C.4** `DriverNodeManager.HistoryUpdate` — uses `NodePermissions.HistoryUpdate`.
5. **C.5** `DriverNodeManager.CreateMonitoredItems` — per-`MonitoredItemCreateResult` denial in mixed-authorization batch; partial success path per OPC UA Part 4. Each created item stamped `(AuthGenerationId, MembershipVersion)`.
6. **C.6** `DriverNodeManager.TransferSubscriptions` — on reconnect, re-evaluate every transferred `MonitoredItem` against the session's current auth state. Stale-stamp items drop to `BadUserAccessDenied`.
7. **C.7** **Browse + TranslateBrowsePathsToNodeIds** — evaluator called with `OpcUaOperation.Browse`. Ancestor visibility implied when any descendant has a grant (per `acl-design.md` §Browse). Denied ancestors filter from browse results — the UA browser sees a hierarchy truncated at the denied ancestor rather than an inconsistent child-without-parent view.
8. **C.8** `DriverNodeManager.Call``NodePermissions.MethodCall`.
9. **C.9** Alarm actions (Acknowledge / Confirm / Shelve) — per-alarm `NodePermissions.AlarmAck` / `AlarmConfirm` / `AlarmShelve`.
10. **C.10** Publish path — for each `MonitoredItem` with a mismatched `(AuthGenerationId, MembershipVersion)` stamp, re-evaluate. Unchanged items stay fast-path; changes happen at next publish cycle.
11. **C.11** Integration tests: three-user seed with different memberships; matrix covers every operation in §Scope. Mixed-batch tests for Read + CreateMonitoredItems.
### Stream D — Admin UI refresh (4 days)
1. **D.1** `RoleGrantsTab.razor` — FleetAdmin-gated CRUD on `LdapGroupRoleMapping`. Per-cluster dropdown + system-wide checkbox. Validation: LDAP group must exist in the dev LDAP (GLAuth) before saving — best-effort probe with graceful degradation.
2. **D.2** `AclsTab.razor` rewrites its edit path to write through the new `NodeAclService`. Adds a "Probe this permission" row: choose `(LDAP group, node, action)` → shows Allow / Deny + the reason (which grant matched).
3. **D.3** Draft-generation diff viewer now includes an ACL section: "X grants added, Y grants removed, Z grants changed."
4. **D.4** SignalR notification: `PermissionTrieCache` invalidation on `sp_PublishGeneration` pushes to Admin UI so operators see "this clusters permissions were just updated" within 2 s.
## Compliance Checks (run at exit gate)
- [ ] **Control/data-plane separation**: `LdapGroupRoleMapping` consumed only by Admin UI; the data-path evaluator has zero references to it. Enforced via a project-reference audit (Admin project references the mapping service; `Core.Authorization` does not).
- [ ] **Every operation wired**: Browse, Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge, Confirm, Shelve all consult the evaluator. Integration test matrix covers every operation × allow/deny.
- [ ] **HistoryRead uses its own flag**: test "user with Read + no HistoryRead gets `BadUserAccessDenied` on HistoryRead".
- [ ] **Mixed-batch semantics**: Read of 5 nodes (3 allowed + 2 denied) returns 3 Good + 2 `BadUserAccessDenied` per-`ReadValueId`; CreateMonitoredItems equivalent.
- [ ] **Browse ancestor visibility**: user with a grant only on a deep equipment node can browse the path to it (ancestors implied); denied ancestors filter from browse results otherwise.
- [ ] **Galaxy FolderSegment coverage**: a grant on a Galaxy folder subtree cascades to its tags; sibling folders are unaffected. Trie test covers this.
- [ ] **Subscription re-authorization**: integration test — create item, revoke grant via draft+publish, next publish cycle the item returns `BadUserAccessDenied` (not silently still-notifying).
- [ ] **Membership freshness**: test — 15 min MembershipFreshnessInterval elapses on a long-lived session + LDAP now unreachable → authz fails closed on the next request until LDAP recovers.
- [ ] **Auth cache fail-closed**: test — Phase 6.1 cache serves stale config for 6 min; authz evaluator refuses all calls after 5 min regardless.
- [ ] **Trie invariants**: `PermissionTrieBuilder` is idempotent (build twice with identical inputs → equal tries).
- [ ] **Additive grants + cluster isolation**: cluster-grant cascades; cross-cluster leakage impossible.
- [ ] **Redundancy-safe invalidation**: integration test — two nodes, a publish on one, authorize a request on the other before in-process event propagates → generation-mismatch forces re-load, no stale decision.
- [ ] **Authoring validation**: `AclsTab` cannot save a `(LdapGroup, Scope)` pair that already exists in the draft; operator sees the validation error pre-save.
- [ ] **AuthorizationDecision shape stability**: API surface exposes `Allow` + `NotGranted` only; `Denied` variant exists in the type but is never produced; v2.1 can add Deny without API break.
- [ ] No regression in driver test counts.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| ACL evaluator latency on per-read hot path | Medium | High | Trie lookup is O(depth) = O(6); session-cached UserAuthorizationState avoids per-Read trie rebuild; benchmark in Stream B.6 |
| Trie cache stale after a rollback | Medium | High | `sp_PublishGeneration` + `sp_RollbackGeneration` both emit the invalidation event; trie keyed on `(ClusterId, GenerationId)` so rollback fetches the prior trie cleanly |
| `BadUserAccessDenied` returns expose sensitive browse-name metadata | Low | Medium | Server returns only the status code + NodeId; no message leak per OPC UA Part 4 §7.34 guidance |
| LdapGroupRoleMapping migration breaks existing deployments | Low | High | Seed-migration preserves the hardcoded groups' effective grants verbatim; smoke test exercises the post-migration fleet admin login |
| Deny semantics accidentally ship (would break `acl-design.md` defer) | Low | Medium | `IPermissionEvaluator.Authorize` returns `bool` (not tri-state) through Phase 6.2; widening to `Allow`/`Deny`/`Indeterminate` is a v2.1 ticket |
## Completion Checklist
- [ ] Stream A: `LdapGroupRoleMapping` entity + migration + CRUD + seed
- [ ] Stream B: evaluator + trie builder + cache + per-session state + unit tests
- [ ] Stream C: OPC UA dispatch wiring on Read/Write/HistoryRead/Subscribe/Alarm paths
- [ ] Stream D: Admin UI `RoleGrantsTab` + `AclsTab` refresh + SignalR invalidation
- [ ] `phase-6-2-compliance.ps1` exits 0; exit-gate doc recorded
## Adversarial Review — 2026-04-19 (Codex, thread `019da48d-0d2b-7171-aed2-fc05f1f39ca3`)
1. **Crit · ACCEPT** — Trie must not conflate `LdapGroupRoleMapping` (control-plane admin claims per decision #105) with data-plane ACLs (decision #129). **Change**: `LdapGroupRoleMapping` is consumed only by the Admin UI role router. Data-plane trie reads `NodeAcl` rows joined against the session's **resolved LDAP groups**, never admin roles. Stream B.2 updated.
2. **Crit · ACCEPT** — Cached `UserAuthorizationState` survives LDAP group changes because memberships only refresh at cookie-auth. Change: add `MembershipFreshnessInterval` (default 15 min); past that, next hot-path authz call forces group re-resolution (fail-closed if LDAP unreachable). Session-close-wins on config-rollback.
3. **High · ACCEPT** — Node-local invalidation doesn't extend across redundant pair. **Change**: trie keyed on `(ClusterId, GenerationId)`; hot-path authz looks up `CurrentGenerationId` from the shared config DB (Polly-wrapped + sub-second cache). A Backup that read stale generation gets a mismatched trie → forces re-load. Implementation note added to Stream B.4.
4. **High · ACCEPT** — Browse enforcement missing. **Change**: new Stream C.7 (`Browse + TranslateBrowsePathsToNodeIds` enforcement). Ancestor visibility implied when any descendant has a grant; denied ancestors filter from browse results per `acl-design.md` §Browse.
5. **High · ACCEPT**`HistoryRead` should use `NodePermissions.HistoryRead` bit, not `Read`. **Change**: Stream C.3 revised; separate unit test asserts `Read+no-HistoryRead` denies HistoryRead while allowing current-value reads.
6. **High · ACCEPT** — Galaxy shallow-path (Cluster→Namespace→Tag) loses folder hierarchy authorization. **Change**: SystemPlatform namespaces use a `FolderSegment` scope-level between Namespace and Tag, populated from `Tag.FolderPath`; UNS-kind namespaces keep the 6-level hierarchy. Trie supports both via `ScopeKind` on each node.
7. **High · ACCEPT** — Subscription re-authorization policy unresolved between create-time-only (fast, wrong on revoke) and per-publish (slow). **Change**: stamp each `MonitoredItem` with `(AuthGenerationId, MembershipVersion)`; re-evaluate on Publish only when either version changed. Revoked items drop to `BadUserAccessDenied` within one publish cycle.
8. **Med · ACCEPT** — Mixed-authorization batch `Read` / `CreateMonitoredItems` service-result semantics underspecified. **Change**: Stream C.6 explicitly tests per-`ReadValueId` + per-`MonitoredItemCreateResult` denial in mixed batches; batch never collapses to a coarse failure.
9. **Med · ACCEPT** — Missing surfaces: `Method.Call`, `HistoryUpdate`, event filter on subscriptions, subscription-transfer on reconnect, alarm-ack. **Change**: scope expanded — every OPC UA authorization surface enumerated in Stream C: Read, Write, HistoryRead, HistoryUpdate, CreateMonitoredItems, TransferSubscriptions, Call, Acknowledge/Confirm/Shelve, Browse, TranslateBrowsePathsToNodeIds.
10. **Med · ACCEPT**`bool` evaluator bakes in grant-only semantics; collides with v2.1 Deny. **Change**: internal model uses `AuthorizationDecision { Allow | NotGranted | Denied, IReadOnlyList<MatchedGrant> Provenance }`. Phase 6.2 maps `Denied` → never produced; UI + audit log use the full record so v2.1 Deny lands without API break.
11. **Med · ACCEPT** — 6.1 cache fallback is availability-oriented; applying it to auth is correctness-dangerous. **Change**: auth-specific staleness budget `AuthCacheMaxStaleness` (default 5 min, not 24 h). Past that, hot-path evaluator fails closed on cached reads; all authorization calls return `NotGranted` until fresh data lands. Documented in risks + compliance.
12. **Low · ACCEPT** — Existing `NodeAclService` is raw CRUD. **Change**: new `ValidatedNodeAclAuthoringService` enforces scope-uniqueness + draft/publish invariants + rejects invalid (LDAP group, scope) pairs; Admin UI writes through it only. Stream D.2 adjusted.

View File

@@ -0,0 +1,159 @@
# Phase 6.3 — Redundancy Runtime
> **Status**: **SHIPPED (core)** 2026-04-19 — Streams B (ServiceLevelCalculator + RecoveryStateManager) and D core (ApplyLeaseRegistry) merged to `v2` in PR #89. Exit gate in PR #90.
>
> Deferred follow-ups (tracked separately):
> - Stream A — RedundancyCoordinator cluster-topology loader (task #145).
> - Stream C — OPC UA node wiring: ServiceLevel + ServerUriArray + RedundancySupport (task #147).
> - Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR (task #149).
> - Stream F — client interop matrix + Galaxy MXAccess failover test (task #150).
> - sp_PublishGeneration pre-publish validator rejecting unsupported RedundancyMode values (task #148 part 2 — SQL-side).
>
> Baseline pre-Phase-6.3: 1097 solution tests → post-Phase-6.3 core: 1137 passing (+40 net).
>
> **Branch**: `v2/phase-6-3-redundancy-runtime`
> **Estimated duration**: 2 weeks
> **Predecessor**: Phase 6.2 (Authorization) — reuses the Phase 6.1 health endpoints for cluster-peer probing
> **Successor**: Phase 6.4 (Admin UI completion)
## Phase Objective
Land the non-transparent redundancy protocol end-to-end: two `OtOpcUa.Server` instances in a `ServerCluster` each expose a live `ServiceLevel` node whose value reflects that instance's suitability to serve traffic, advertise each other via `ServerUriArray`, and transition role (Primary ↔ Backup) based on health + operator intent.
Closes these gaps:
1. **Dynamic `ServiceLevel`** — OPC UA Part 5 §6.3.34 specifies a Byte (0..255) that clients poll to pick the healthiest server. Our server publishes it as a static value today.
2. **`ServerUriArray` broadcast** — Part 4 specifies that every node in a redundant pair should advertise its peers' ApplicationUris. Currently advertises only its own.
3. **Primary / Backup role coordination** — entities carry `RedundancyRole` but the runtime doesn't read it; no peer health probing; no role-transfer on primary failure.
4. **Mid-apply dip** — decision-level expectation that a server mid-generation-apply should report a *lower* ServiceLevel so clients cut over to the peer during the apply window. Not implemented.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `OtOpcUa.Server` → new `Server.Redundancy` sub-namespace | `RedundancyCoordinator` singleton. Resolves the current node's `ClusterNode` row at startup, loads peers, runs **two-layer peer health probe**: (a) `/healthz` every 2 s as the fast-fail (inherits Phase 6.1 semantics — HTTP + DB/cache healthy); (b) `UaHealthProbe` every 10 s — opens a lightweight OPC UA client session to the peer + reads its `ServiceLevel` node + verifies endpoint serves data. Authority decisions use UaHealthProbe; `/healthz` is used only to avoid wasting UA probes when peer is obviously down. |
| Publish-generation fencing | Topology + role decisions are stamped with a monotonic `ConfigGenerationId` from the shared config DB. Coordinator re-reads topology via CAS on `(ClusterId, ExpectedGeneration)` → new row; peers reject state propagated from a lower generation. Prevents split-publish races. |
| `InvalidTopology` runtime state | If both nodes detect >1 Primary AFTER startup (config-DB drift during a publish), both self-demote to ServiceLevel 2 until convergence. Neither node serves authoritatively; clients pick the healthier alternative or reconnect later. |
| OPC UA server root | `ServiceLevel` variable node becomes a `BaseDataVariable` whose value updates on `RedundancyCoordinator` state change. `ServerUriArray` array variable includes **self + peers** in stable deterministic ordering (decision per OPC UA Part 4 §6.6.2.2). `RedundancySupport` stays static (set from `RedundancyMode` at startup); `Transparent` mode validated pre-publish, not rejected at startup. |
| `RedundancyCoordinator` computation | **8-state ServiceLevel matrix** — avoids OPC UA Part 5 §6.3.34 collision (`0=Maintenance`, `1=NoData`). Operator-declared maintenance only = **0**. Unreachable / Faulted = **1**. In-range operational states occupy **2..255**: Authoritative-Primary = **255**; Isolated-Primary (peer unreachable, self serving) = **230**; Primary-Mid-Apply = **200**; Recovering-Primary (post-fault, dwell not met) = **180**; Authoritative-Backup = **100**; Isolated-Backup (primary unreachable, "take over if asked") = **80**; Backup-Mid-Apply = **50**; Recovering-Backup = **30**; `InvalidTopology` (runtime detects >1 Primary) = **2** (detected-inconsistency band — below normal operation). Full matrix documented in `docs/Redundancy.md` update. |
| Role transition | Split-brain avoidance: role is *declared* in the shared config DB (`ClusterNode.RedundancyRole`), not elected at runtime. An operator flips the row (or a failover script does). Coordinator only reads; never writes. |
| `sp_PublishGeneration` hook | Uses named **apply leases** keyed to `(ConfigGenerationId, PublishRequestId)`. `await using var lease = coordinator.BeginApplyLease(...)`. Disposal on any exit path (success, exception, cancellation) decrements. Watchdog auto-closes any lease older than `ApplyMaxDuration` (default 10 min) → ServiceLevel can't stick at mid-apply. Pre-publish validator rejects unsupported `RedundancyMode` (e.g. `Transparent`) with a clear error so runtime never sees an invalid state. |
| Admin UI `/cluster/{id}` page | New `RedundancyTab.razor` — shows current node's role + ServiceLevel + peer reachability. FleetAdmin can trigger a role-swap by editing `ClusterNode.RedundancyRole` + publishing a draft. |
| Metrics | New OpenTelemetry metrics: `ot_opcua_service_level{cluster,node}`, `ot_opcua_peer_reachable{cluster,node,peer}`, `ot_opcua_apply_in_progress{cluster,node}`. Sink via Phase 6.1 observability layer. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| OPC UA authn / authz | Phases 6.2 + prior. Redundancy is orthogonal. |
| Driver layer | Drivers aren't redundancy-aware; they run on each node independently against the same equipment. The server layer handles the ServiceLevel story. |
| Automatic failover / election | Explicitly out of scope. Non-transparent = client picks which server to use via ServiceLevel + ServerUriArray. We do NOT ship consensus, leader election, or automatic promotion. Operator-driven failover is the v2.0 model per decision #7985. |
| Transparent redundancy (`RedundancySupport=Transparent`) | Not supported. If the operator asks for it the server fails startup with a clear error. |
| Historian redundancy | Galaxy Historian's own redundancy (two historians on two CPUs) is out of scope. The Galaxy driver talks to whichever historian is reachable from its node. |
## Entry Gate Checklist
- [ ] Phase 6.1 merged (uses `/healthz` for peer probing)
- [ ] `CLAUDE.md` §Redundancy + `docs/Redundancy.md` re-read
- [ ] Decisions #7985 re-skimmed
- [ ] `ServerCluster`/`ClusterNode`/`RedundancyRole`/`RedundancyMode` entities + existing migration reviewed
- [ ] OPC UA Part 4 §Redundancy + Part 5 §6.3.34 (ServiceLevel) re-skimmed
- [ ] Dev box has two OtOpcUa.Server instances configured against the same cluster — one designated Primary, one Backup — for integration testing
## Task Breakdown
### Stream A — Cluster topology loader (3 days)
1. **A.1** `RedundancyCoordinator` startup path: reads `ClusterNode` row for the current node (identified by `appsettings.json` `Cluster:NodeId`), reads the cluster's peer list, validates invariants (no duplicate `ApplicationUri`, at most one `Primary` per cluster if `RedundancyMode.WarmActive`, at most two nodes total in v2.0 per decision #83).
2. **A.2** Topology subscription — coordinator re-reads on `sp_PublishGeneration` confirmation so an operator role-swap takes effect after publish (no process restart needed).
3. **A.3** Tests: two-node cluster seed, one-node cluster seed (degenerate), duplicate-uri rejection.
### Stream B — Peer health probing + ServiceLevel computation (6 days, widened)
1. **B.1** `PeerHttpProbeLoop` per peer at 2 s — calls peer's `/healthz`, 1 s timeout, exponential backoff on sustained failure. Used as fast-fail.
2. **B.2** `PeerUaProbeLoop` per peer at 10 s — opens an OPC UA client session to the peer (reuses Phase 5 `Driver.OpcUaClient` stack), reads peer's `ServiceLevel` node + verifies endpoint serves data. Short-circuit: if HTTP probe is failing, skip UA probe (no wasted sessions).
3. **B.3** `ServiceLevelCalculator.Compute(role, selfHealth, peerHttpHealthy, peerUaHealthy, applyInProgress, recoveryDwellMet, topologyValid) → byte`. 8-state matrix per §Scope. `topologyValid=false` forces InvalidTopology = 2 regardless of other inputs.
4. **B.4** `RecoveryStateManager`: after a `Faulted → Healthy` transition, hold driver in `Recovering` band (180 Primary / 30 Backup) for `RecoveryDwellTime` (default 60 s) AND require one positive publish witness (successful `Read` on a reference node) before entering Authoritative band.
5. **B.5** Calculator reacts to inputs via `IObserver` so changes immediately push to the OPC UA `ServiceLevel` node.
6. **B.6** Tests: **64-case matrix** covering role × self-health × peer-http × peer-ua × apply × recovery × topology. Specific cases flagged: Primary-with-unreachable-peer-serves-at-230 (authority retained); Backup-with-unreachable-primary-escalates-to-80 (not auto-promote); InvalidTopology demotes both nodes; Recovering dwell + publish-witness blocks premature return to 255.
### Stream C — OPC UA node wiring (3 days)
1. **C.1** `ServiceLevel` variable node created under `ServerStatus` at server startup. Type `Byte`, AccessLevel = CurrentRead only. Subscribe to `ServiceLevelCalculator` observable; push updates via `DataChangeNotification`.
2. **C.2** `ServerUriArray` variable node under `ServerCapabilities`. Array of `String`, **includes self + peers** with deterministic ordering (self first). Updates on topology change. Compliance test asserts local-plus-peer membership.
3. **C.3** `RedundancySupport` variable — static at startup from `RedundancyMode`. Values: `None`, `Cold`, `Warm`, `WarmActive`, `Hot`. Unsupported values (`Transparent`, `HotAndMirrored`) are rejected **pre-publish** by validator — runtime never sees them.
4. **C.4** Client.CLI cutover test: connect to primary, read `ServiceLevel` → 255; pause primary apply → 200; unreachable peer while apply in progress → 200 (apply dominates peer-unreachable per matrix); client sees peer via `ServerUriArray`; fail primary → client reconnects to peer at 80 (isolated-backup band).
### Stream D — Apply-window integration (3 days)
1. **D.1** `sp_PublishGeneration` caller wraps the apply in `await using var lease = coordinator.BeginApplyLease(generationId, publishRequestId)`. Lease keyed to `(ConfigGenerationId, PublishRequestId)` so concurrent publishes stay isolated. Disposal decrements on every exit path.
2. **D.2** `ApplyLeaseWatchdog` auto-closes leases older than `ApplyMaxDuration` (default 10 min) so a crashed publisher can't pin the node at mid-apply.
3. **D.3** Pre-publish validator in `sp_PublishGeneration` rejects unsupported `RedundancyMode` values (`Transparent`, `HotAndMirrored`) with a clear error message — runtime never sees an invalid mode.
4. **D.4** Tests: (a) mid-apply client subscribes → sees ServiceLevel drop → sees restore; (b) lease leak via `ThreadAbort` / cancellation → watchdog closes; (c) publish rejected for `Transparent` → operator-actionable error.
### Stream E — Admin UI + metrics (3 days)
1. **E.1** `RedundancyTab.razor` under `/cluster/{id}/redundancy`. Shows each node's role, current ServiceLevel (with band label per 8-state matrix), peer reachability (HTTP + UA probe separately), last apply timestamp. Role-swap button posts a draft edit on `ClusterNode.RedundancyRole`; publish applies.
2. **E.2** OpenTelemetry meter export: `ot_opcua_service_level{cluster,node}` gauge + `ot_opcua_peer_reachable{cluster,node,peer,kind=http|ua}` + `ot_opcua_apply_in_progress{cluster,node}` + `ot_opcua_topology_valid{cluster}`. Sink via Phase 6.1 observability.
3. **E.3** SignalR push: `FleetStatusHub` broadcasts ServiceLevel changes so the Admin UI updates within ~1 s of the coordinator observing a peer flip.
### Stream F — Client-interoperability matrix (3 days, new)
1. **F.1** Validate ServiceLevel-driven cutover against **Ignition 8.1 + 8.3**, **Kepware KEPServerEX 6.x**, **Aveva OI Gateway 2020R2 + 2023R1**. For each: configure the client with both endpoints, verify it honors `ServiceLevel` + `ServerUriArray` during primary failover.
2. **F.2** Clients that don't honour the standards (doc field — may include Kepware and OI Gateway per Codex review) get an explicit compatibility-matrix entry: "requires manual backup-endpoint config / vendor-specific redundancy primitives". Documented in `docs/Redundancy.md`.
3. **F.3** Galaxy MXAccess failover test — boot Galaxy.Proxy on both nodes, kill Primary, assert Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)`. Document required session-timeout config in `docs/Redundancy.md`.
## Compliance Checks (run at exit gate)
- [ ] **OPC UA band compliance**: `0=Maintenance` reserved, `1=NoData` reserved. Operational states in 2..255 per 8-state matrix.
- [ ] **Authoritative-Primary** ServiceLevel = 255.
- [ ] **Isolated-Primary** (peer unreachable, self serving) = 230 — Primary retains authority.
- [ ] **Primary-Mid-Apply** = 200.
- [ ] **Recovering-Primary** = 180 with dwell + publish witness enforced.
- [ ] **Authoritative-Backup** = 100.
- [ ] **Isolated-Backup** (primary unreachable) = 80 — does NOT auto-promote.
- [ ] **InvalidTopology** = 2 — both nodes self-demote when >1 Primary detected runtime.
- [ ] **ServerUriArray** returns self + peer URIs, self first.
- [ ] **UaHealthProbe authority**: integration test — peer returns HTTP 200 but OPC UA endpoint unreachable → coordinator treats peer as UA-unhealthy; peer is not a valid authority source.
- [ ] **Apply-lease disposal**: leases close on exception, cancellation, and watchdog timeout; ServiceLevel never sticks at mid-apply band.
- [ ] **Transparent-mode rejection**: attempting to publish `RedundancyMode=Transparent` is blocked at `sp_PublishGeneration`; runtime never sees an invalid mode.
- [ ] **Role transition via operator publish**: FleetAdmin swaps `RedundancyRole` in a draft, publishes; both nodes re-read topology on publish confirmation + flip ServiceLevel — no restart.
- [ ] **Client.CLI cutover**: with primary halted, Client.CLI that was connected to primary sees primary drop + reconnects to backup via `ServerUriArray`.
- [ ] **Client interoperability matrix** (Stream F): Ignition 8.1 + 8.3 honour ServiceLevel; Kepware + Aveva OI Gateway findings documented.
- [ ] **Galaxy MXAccess failover**: end-to-end test — primary kill → Galaxy consumer reconnects to backup within session-timeout budget.
- [ ] No regression in existing driver test suites; no regression in `/healthz` reachability under redundancy load.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| Split-brain from operator race (both nodes marked Primary) | Low | High | Coordinator rejects startup if its cluster has >1 Primary row; logs + fails fast. Document as a publish-time validation in `sp_PublishGeneration`. |
| ServiceLevel thrashing on flaky peer | Medium | Medium | 2 s probe interval + 3-sample smoothing window; only declares a peer unreachable after 3 consecutive failed probes |
| Client ignores ServiceLevel and stays on broken primary | Medium | Medium | Documented in `docs/Redundancy.md` — non-transparent redundancy requires client cooperation; most SCADA clients (Ignition, Kepware, Aveva OI Gateway) honor it. Unit-test the advertised values; field behavior is client-responsibility |
| Apply-window counter leaks on exception | Low | High | `BeginApplyWindow` returns `IDisposable`; `using` syntax enforces paired decrement; unit test for exception-in-apply path |
| `HttpClient` probe leaks sockets | Low | Medium | Single shared `HttpClient` per coordinator (not per-probe); timeouts tight to avoid keeping connections open during peer downtime |
## Completion Checklist
- [ ] Stream A: topology loader + tests
- [ ] Stream B: peer probe + ServiceLevel calculator + 32-case matrix tests
- [ ] Stream C: ServiceLevel / ServerUriArray / RedundancySupport node wiring + Client.CLI smoke test
- [ ] Stream D: apply-window integration + nested-apply counter
- [ ] Stream E: Admin `RedundancyTab` + OpenTelemetry metrics + SignalR push
- [ ] `phase-6-3-compliance.ps1` exits 0; exit-gate doc; `docs/Redundancy.md` updated with the ServiceLevel matrix
## Adversarial Review — 2026-04-19 (Codex, thread `019da490-3fa0-7340-98b8-cceeca802550`)
1. **Crit · ACCEPT** — No publish-generation fencing enables split-publish advertising both as authoritative. **Change**: coordinator CAS on a monotonic `ConfigGenerationId`; every topology decision is generation-stamped; peers reject state propagated from a lower generation.
2. **Crit · ACCEPT**`>1 Primary` at startup covered but runtime containment missing when invalid topology appears later (mid-apply race). **Change**: add runtime `InvalidTopology` state — both nodes self-demote to ServiceLevel 2 (the "detected inconsistency" band, below normal operation) until convergence.
3. **High · ACCEPT**`0 = Faulted` collides with OPC UA Part 5 §6.3.34 semantics where 0 means **Maintenance** and 1 means NoData. **Change**: reserve **0** for operator-declared maintenance-mode only; Faulted/unreachable uses **1** (NoData); in-range degraded states occupy 2..199.
4. **High · ACCEPT** — Matrix collapses distinct operational states onto the same value. **Change**: matrix expanded to Authoritative-Primary=255, Isolated-Primary=230 (peer unreachable — still serving), Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80 (primary unreachable — "take over if asked"), Backup-Mid-Apply=50, Recovering-Backup=30.
5. **High · ACCEPT**`/healthz` from 6.1 is HTTP-healthy but doesn't guarantee OPC UA data plane. **Change**: add a redundancy-specific probe `UaHealthProbe` — issues a `ReadAsync(ServiceLevel)` against the peer's OPC UA endpoint via a lightweight client session. `/healthz` remains the fast-fail; the UA probe is the authority signal.
6. **High · ACCEPT**`ServerUriArray` must include self + peers, not peers only. **Change**: array contains `[self.ApplicationUri, peer.ApplicationUri]` in stable deterministic ordering; compliance test asserts local-plus-peer membership.
7. **Med · ACCEPT** — No `Faulted → Recovering → Healthy` path. **Change**: add `Recovering` state with min dwell time (60 s default) + positive publish witness (one successful Read on a reference node) before returning to Healthy. Thrash-prevention.
8. **Med · ACCEPT** — Topology change during in-flight probe undefined. **Change**: every probe task tagged with `ConfigGenerationId` at dispatch; obsolete results discarded; in-flight probes cancelled on topology reload.
9. **Med · ACCEPT** — Apply-window counter race on exception/cancellation/async ownership. **Change**: apply-window is a named lease keyed to `(ConfigGenerationId, PublishRequestId)` with disposal enforced via `await using`; watchdog detects leased-but-abandoned and force-closes after `ApplyMaxDuration` (default 10 min).
10. **High · ACCEPT** — Ignition + Kepware + Aveva OI Gateway `ServiceLevel` compliance is unverified. **Change**: risk elevated to High; add Stream F (new) — build an interop matrix: validate against Ignition 8.1/8.3, Kepware KEPServerEX 6.x, Aveva OI Gateway 2020R2 + 2023R1. Document per-client cutover behaviour. Field deployments get a documented compatibility table; clients that ignore ServiceLevel documented as requiring explicit backup-endpoint config.
11. **Med · ACCEPT** — Galaxy MXAccess re-session on Primary death not in acceptance. **Change**: Stream F adds an end-to-end failover smoke test that boots Galaxy.Proxy on both nodes, kills Primary, asserts Galaxy consumer reconnects to Backup within `(SessionTimeout + KeepAliveInterval × 3)` budget. `docs/Redundancy.md` updated with required session timeouts.
12. **Med · ACCEPT** — Transparent-mode startup rejection is outage-prone. **Change**: `sp_PublishGeneration` validates `RedundancyMode` pre-publish — unsupported values reject the publish attempt with a clear validation error; runtime never sees an unsupported mode. Last-good config stays active.

View File

@@ -0,0 +1,142 @@
# Phase 6.4 — Admin UI Completion
> **Status**: **SHIPPED (data layer)** 2026-04-19 — Stream A.2 (UnsImpactAnalyzer + DraftRevisionToken) and Stream B.1 (EquipmentCsvImporter parser) merged to `v2` in PR #91. Exit gate in PR #92.
>
> Deferred follow-ups (Blazor UI + staging tables + address-space wiring):
> - Stream A UI — UnsTab MudBlazor drag/drop + 409 concurrent-edit modal + Playwright smoke (task #153).
> - Stream B follow-up — EquipmentImportBatch staging + FinaliseImportBatch transaction + CSV import UI (task #155).
> - Stream C — DiffViewer refactor into base + 6 section plugins + 1000-row cap + SignalR paging (task #156).
> - Stream D — IdentificationFields.razor + DriverNodeManager OPC 40010 sub-folder exposure (task #157).
>
> Baseline pre-Phase-6.4: 1137 solution tests → post-Phase-6.4 data layer: 1159 passing (+22).
>
> **Branch**: `v2/phase-6-4-admin-ui-completion`
> **Estimated duration**: 2 weeks
> **Predecessor**: Phase 6.3 (Redundancy runtime) — reuses the `/cluster/{id}` page layout for the new tabs
> **Successor**: v2 release-readiness capstone (Task #121)
## Phase Objective
Close the Admin UI feature-completeness checklist that Phase 1 Stream E exit gate left open. Each item below is an existing `phase-1-configuration-and-admin-scaffold.md` completion-checklist entry that is currently unchecked.
Gaps to close:
1. **UNS Structure tab drag/move with impact preview** — decision #115 + `admin-ui.md` §"UNS". Current state: list-only render; no drag reorder; no "X lines / Y equipment impacted" preview.
2. **Equipment CSV import + 5-identifier search** — decision #95 + #117. Current state: basic form; no CSV parser; search indexes only ZTag.
3. **Draft-generation diff viewer** — enhance existing `DiffViewer.razor` to show generation-diff not just staged-edit diff; highlight ACL grant changes (lands after Phase 6.2).
4. **`_base` equipment-class Identification fields exposure** — decision #138139. Columns exist on `Equipment`; no Admin UI field group; no address-space exposure of the OPC 40010 sub-folder.
## Scope — What Changes
| Concern | Change |
|---------|--------|
| `Admin/Pages/UnsTab.razor` | Tree component with drag-drop using **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (existing transitive dep — no new third-party package). Native HTML5 DnD rejected because virtualization + DnD on 500+ nodes doesn't combine reliably. Each drag fires a "Compute Impact" call carrying a `DraftRevisionToken`; modal preview ("Moving Line 'Oven-2' from 'Packaging' to 'Assembly' will re-home 14 equipment + re-parent 237 tags"). **Confirm step re-checks the token** and rejects with a `409 Conflict / refresh-required` modal if the draft advanced between preview and commit. |
| `Admin/Services/UnsImpactAnalyzer.cs` | New service. Given a move-operation (line move, area rename, line merge), computes cascade counts + `DraftRevisionToken` at preview time. Pure-function shape; testable in isolation. |
| `Admin/Pages/EquipmentTab.razor` | Add CSV-import button → modal with file picker + dry-run preview. **Identifier search** uses the canonical decision #117 set: `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. Typeahead probes each column with a ranking query (exact match score 100 → prefix 50 → opt-in LIKE 20; published > draft tie-break). Result row shows which field matched via trailing badge. |
| `Admin/Services/EquipmentCsvImporter.cs` | New service. CSV header row must start with `# OtOpcUaCsv v1` (version marker — future shape changes bump the version). Columns: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid, Name, UnsAreaName, UnsLineName, Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Parser rejects unknown columns + blank required fields + duplicate ZTags + missing UnsLines. |
| **Staged-import table** `EquipmentImportBatch` | New entity `{ Id, CreatedAtUtc, CreatedBy, RowsStaged, RowsAccepted, RowsRejected, FinalisedAtUtc? }` + child `EquipmentImportRow` records. Import writes rows in chunks to the staging table (not to `Equipment`). `FinaliseImportBatch` is the atomic finalize step that applies all accepted rows to `Equipment` + `ExternalIdReservation` in one transaction — short + bounded regardless of input size. Rollback = drop the batch row; `Equipment` never partially mutates. |
| `Admin/Pages/DraftEditor.razor` + `DiffViewer.razor` | Diff viewer refactored into a base component + section plugins: `StructuralDiffSection`, `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`. Each section has a **1000-row hard cap**; over-cap renders an aggregate summary + "Load full diff" button streaming 500-row pages via SignalR. Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default. |
| `Admin/Components/IdentificationFields.razor` | New component. Renders the OPC 40010 field set **per decision #139**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from this phase — they need a separate decision-log widening. |
| `OtOpcUa.Server/OpcUa/DriverNodeManager` — Equipment folder build | When an `Equipment` row has non-null Identification fields, the server adds an `Identification` sub-folder under the Equipment node containing one variable per non-null field. **ACL binding**: the sub-folder + variables inherit the `Equipment` scope's grants from Phase 6.2's trie — no new scope level added. Documented in `acl-design.md` cross-reference update. |
## Scope — What Does NOT Change
| Item | Reason |
|------|--------|
| Admin UI visual language | Bootstrap 5 / cookie auth / sidebar layout unchanged — consistency with ScadaLink design reference. |
| LDAP auth flow | Already shipped in Phase 1. Phase 6.4 is additive UI only. |
| Core abstractions / driver layer | Admin UI changes don't touch drivers. |
| Equipment-class *template schema validation* | Still deferred (decision #112 — schemas repo not landed). We expose the Identification fields but don't validate against a template hierarchy. |
| Drag/move to *other clusters* | Out of scope — equipment is cluster-scoped per decision #82. Cross-cluster migration is a different workflow. |
## Entry Gate Checklist
- [ ] Phase 6.2 merged (ACL grants are part of the new diff viewer sections)
- [ ] Phase 6.3 merged (redundancy-role changes are part of the diff viewer)
- [ ] `phase-1-configuration-and-admin-scaffold.md` §Stream E completion checklist re-read — confirm these are the remaining items
- [ ] `admin-ui.md` re-skimmed for screen layouts
- [ ] Existing `EquipmentTab.razor` / `UnsTab.razor` / `DraftEditor.razor` diff'd against what ships today so the edits are additive not destructive
- [ ] Dev Galaxy available for OPC 40010 exposure smoke testing
## Task Breakdown
### Stream A — UNS drag/reorder + impact preview (5 days)
1. **A.1** 1000-node synthetic seed fixture. Drag-latency bench against `MudBlazor.TreeView` + `MudBlazor.DropTarget` — commit to the component if latency budget (100 ms drag-enter feedback) holds; fall back to flat-list reorder UI (Area/Line dropdowns) with loss of visual drag affordance otherwise.
2. **A.2** `UnsImpactAnalyzer` service. Inputs: `(DraftGenerationId, MoveOperation, DraftRevisionToken)`. Outputs: `ImpactPreview { AffectedEquipmentCount, AffectedTagCount, CascadeWarnings[], DraftRevisionToken }`. Pure-function shape; testable in isolation.
3. **A.3** Modal preview wired to `UnsImpactAnalyzer`. **Confirm** re-reads the current draft revision + compares against the preview's token; if the draft advanced (another operator saved a different edit), show a `409 Conflict / refresh-required` modal rather than silently overwriting.
4. **A.4** Cross-cluster drop attempts: target disabled + toast "Equipment is cluster-scoped (decision #82). To move across clusters, use Export → Import on the Cluster detail page." Plus help link.
5. **A.5** Playwright (or equivalent) smoke test: drag a line across areas, assert modal shows right counts, assert draft row reflects the move; concurrent-edit test runs two sessions + asserts the later Confirm hits the 409.
### Stream B — Equipment CSV import + 5-identifier search (5 days)
1. **B.1** `EquipmentCsvImporter`. Strict RFC 4180 parser (per decision #95). Header row validation: first line must match `# OtOpcUaCsv v1` — future versions fork parser versions. Required columns: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid, Name, UnsAreaName, UnsLineName`. Optional: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Parser rejects unknown columns + blank required fields + duplicate ZTags.
2. **B.2** `EquipmentImportBatch` + `EquipmentImportRow` staging tables (migration). Import writes preview rows to staging via chunked inserts; staging never blocks `Equipment` or `ExternalIdReservation`. Preview query reads staging + validates each row against the current `Equipment` state + `ExternalIdReservation` freshness.
3. **B.3** `ImportPreview` UI — per-row accept/reject table. Reject reasons: "ZTag already exists in draft", "ExternalIdReservation conflict with Cluster X", "UnsLineName not found in draft UNS tree", etc. Operator reviews + clicks "Commit".
4. **B.4** `FinaliseImportBatch` — atomic finalize. One EF transaction applies accepted rows to `Equipment` + `ExternalIdReservation`; duration bounded regardless of input size (the atomic step is a bulk-insert, not per-row row-by-row). Rollback = drop batch row via `DropImportBatch`; `Equipment` never partially mutates.
5. **B.5** Five-identifier search. Rank SQL: exact match any identifier = score 100, prefix match = 50, LIKE-fuzzy (opt-in via `?fuzzy=true`) = 20; tie-break `published > draft` then `RowVersion DESC`. Typeahead shows which field matched via trailing badge.
6. **B.6** Smoke tests: 100-row CSV with 10 conflicts (5 ZTag dupes, 3 reservation clashes, 2 missing UnsLines); 10k-row perf test asserting finalize txn < 30 s; concurrent import + external `ExternalIdReservation` insert test asserts retryable-conflict handling.
### Stream C — Diff viewer enhancements (4 days)
1. **C.1** Refactor `DiffViewer.razor` into a base component + section plugins. Plugins: `StructuralDiffSection` (UNS tree), `EquipmentDiffSection`, `TagDiffSection`, `AclDiffSection` (Phase 6.2), `RedundancyDiffSection` (Phase 6.3), `IdentificationDiffSection`.
2. **C.2** Each section renders collapsed by default; counts + top-line summary always visible. **1000-row hard cap** per section — over-cap sections render aggregate summary (e.g. "237 equipment re-parented from Packaging to Assembly") with a "Load full diff" button that streams 500-row pages via SignalR.
3. **C.3** Subtree-rename diffs (decision #115 bulk restructure) surface as summary only by default regardless of row count.
4. **C.4** Tests: seed two generations with deliberate diffs; assert every section reports the right counts + top-line summary + hard-cap behavior.
### Stream D — OPC 40010 Identification exposure (3 days)
1. **D.1** `IdentificationFields.razor` component. Renders the **9 decision #139 fields**: `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. Labelled inputs; nullable columns show empty input; required-field validation on commit only.
2. **D.2** `DriverNodeManager` equipment-folder builder — after building the equipment node, inspect the 9 Identification columns; if any non-null, add an `Identification` sub-folder with variable-per-non-null-field. ACL binding: sub-folder + variables inherit the **same `ScopeId` as the Equipment node** (Phase 6.2's trie treats them as part of the Equipment scope — no new scope level).
3. **D.3** Address-space smoke test via Client.CLI: browse an equipment node, assert `Identification` sub-folder present when columns are set, absent when all null, variables match the field values.
4. **D.4** ACL integration test: a user with Equipment-level grant reads the `Identification` variables without needing a separate grant; a user without the Equipment grant gets `BadUserAccessDenied` on both the Equipment node + its Identification variables.
## Compliance Checks (run at exit gate)
- [ ] **UNS drag/move**: drag a line across areas; modal preview shows correct impacted-equipment + impacted-tag counts.
- [ ] **Concurrent-edit safety**: two-session test — session B saves a draft edit after session A opened the preview; session A's Confirm returns `409 Conflict / refresh-required` instead of overwriting.
- [ ] **Cross-cluster drop**: dropping equipment across cluster boundaries is disabled + shows actionable toast pointing to Export/Import workflow.
- [ ] **1000-node tree**: drag operations on a 1000-node seed maintain < 100 ms drag-enter feedback.
- [ ] **CSV header version**: file missing `# OtOpcUaCsv v1` first line is rejected pre-parse.
- [ ] **CSV canonical identifier set**: columns match decision #117 (ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid); drift from the earlier draft surfaces as a test failure.
- [ ] **Staged-import atomicity**: `FinaliseImportBatch` transaction bounded < 30 s for a 10k-row import; pre-finalize stagings visible only to the importing user; rollback via `DropImportBatch`.
- [ ] **Concurrent import + external reservation**: concurrent test — third party inserts to `ExternalIdReservation` mid-finalize; finalize retries with conflict handling; no corruption.
- [ ] **5-identifier search ranking**: exact matches outrank prefix matches; published outranks draft for equal scores.
- [ ] **Diff viewer section caps**: 2000-row subtree-rename diff renders as summary only; "Load full diff" streams in pages.
- [ ] **OPC 40010 field list match**: rendered field group matches decision #139 exactly; no extra fields.
- [ ] **OPC 40010 exposure**: Client.CLI browse shows `Identification` sub-folder when equipment has non-null columns; absent when all null.
- [ ] **ACL inheritance for Identification**: integration test — Equipment-grant user reads Identification; no-grant user gets `BadUserAccessDenied` on both.
- [ ] **Visual parity reviewer**: named role (`FleetAdmin` user, not the implementation lead) compares side-by-side against `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|:----------:|:------:|------------|
| UNS drag-drop janky on large trees (>500 nodes) | Medium | Medium | Virtualize the tree component; default-collapse nested areas; test with a synthetic 1000-equipment seed |
| CSV import performance on 10k-row imports | Medium | Medium | Stream-parse rather than load-into-memory; preview renders in batches of 100; commit is chunked-EF-insert with progress bar |
| Diff viewer becomes unwieldy with many sections | Low | Medium | Each section collapsed by default; top-line summary row always shown; Phase 6.4 caps at 6 sections |
| OPC 40010 sub-folder accidentally exposes NULL/empty identification columns as empty-string variables | Low | Low | Column null-check in the builder; drop variables whose DB value is null |
| 5-identifier search pulls full table | Medium | Medium | Indexes on each of ZTag/SAPID/UniqueId/Alias1/Alias2; search query uses a UNION of 5 indexed lookups; falls back to LIKE only on explicit operator opt-in |
## Completion Checklist
- [ ] Stream A: `UnsImpactAnalyzer` + drag-drop tree + modal preview + Playwright smoke
- [ ] Stream B: `EquipmentCsvImporter` + preview modal + 5-identifier search + conflict-rollback test
- [ ] Stream C: `DiffViewer` refactor + 6 section plugins + 2-generation diff test
- [ ] Stream D: `IdentificationFields.razor` + address-space builder change + Client.CLI browse test
- [ ] Visual-compliance reviewer signoff
- [ ] Full solution `dotnet test` passes; `phase-6-4-compliance.ps1` exits 0; exit-gate doc
## Adversarial Review — 2026-04-19 (Codex, via `codex-rescue` subagent)
1. **Crit · ACCEPT** — Stale UNS impact preview can overwrite concurrent draft edits. **Change**: each preview carries a `DraftRevisionToken`; `Confirm` compares against the current draft + rejects with a `409 Conflict / refresh-required` modal if any draft edit landed since the preview was generated. Stream A.3 updated.
2. **High · ACCEPT** — CSV import atomicity is internally contradictory (single EF transaction vs. chunked inserts). **Change**: one explicit model — staged-import table (`EquipmentImportBatch { Id, CreatedAtUtc, RowsStaged, RowsAccepted, RowsRejected }`) receives rows in chunks; final `FinaliseImportBatch` is atomic over `Equipment` + `ExternalIdReservation`. Rollback is "drop the batch row" — the real Equipment table is never partially mutated.
3. **Crit · ACCEPT** — Identifier contract rewrite mis-cites decisions. **Change**: revert to the `admin-ui.md` + decision #117 canonical set — `ZTag / MachineCode / SAPID / EquipmentId / EquipmentUuid`. CSV header follows that set verbatim. Introduce a separate decision entry for versioned CSV header shape before adding any new column; CSV header row must start with `# OtOpcUaCsv v1` so future shape changes are unambiguous.
4. **Med · ACCEPT** — Search ordering undefined. **Change**: rank SQL — exact match on any identifier scores 100; prefix match 50; LIKE-fuzzy 20; published > draft tie-breaker; `ORDER BY score DESC, RowVersion DESC`. Typeahead shows which field matched via trailing badge.
5. **High · ACCEPT** — HTML5 DnD on virtualized tree is aspirational. **Change**: Stream A.2 rewritten — commits to **`MudBlazor.TreeView` + `MudBlazor.DropTarget`** (already a transitive dep via the existing Admin UI). Build a 1000-node synthetic seed in A.1 + validate drag-latency budget before implementing impact preview. If MudBlazor can't hit the budget, fall back to a flat-list reorder UI with Area/Line dropdowns (loss of visual drag affordance but unblocks the feature).
6. **Med · ACCEPT** — Collapsed-by-default doesn't handle generation-sized diffs. **Change**: each diff section has a hard row cap (1000 by default). Over-cap sections render an aggregate summary + "Load full diff" button that streams via SignalR in 500-row pages. Decision #115 subtree renames surface as a "N equipment re-parented under X → Y" summary instead of row-by-row.
7. **High · ACCEPT** — OPC 40010 field list doesn't match decision #139. **Change**: field group realigned to `Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri`. `ProductInstanceUri / DeviceRevision / MonthOfConstruction` dropped from Phase 6.4 — they belong to a future OPC 40010 widening decision.
8. **High · ACCEPT**`Identification` subtree unreconciled with ACL hierarchy (Phase 6.2 6-level scope). **Change**: address-space builder creates the Identification sub-folder under the Equipment node **with the same ScopeId as Equipment** — no new scope level. ACL evaluator treats `…/Equipment/Identification/X` as inheriting the `Equipment` scope's grants. Documented in Phase 6.2's `acl-design.md` cross-reference update.
9. **Low · ACCEPT** — Visual-review gate names nonexistent reviewer role. **Change**: rubric defined — a named "Admin UX reviewer" (role `FleetAdmin` user, not the implementation lead) compares side-by-side screenshots against the `admin-ui.md` §Visual-Design reference panels; signoff artefact is a checked-in screenshot set under `docs/v2/visual-compliance/phase-6-4/`.
10. **Med · ACCEPT** — Cross-cluster drag/drop lacks loud failure path. **Change**: on drop across cluster boundary, disable the drop target + show a toast "Equipment is cluster-scoped (decision #82). To move across clusters, use the Export → Import workflow on the Cluster detail page." Plus a help link. Tested in Stream A.4.

451
docs/v2/mitsubishi.md Normal file
View File

@@ -0,0 +1,451 @@
# Mitsubishi Electric MELSEC — Modbus TCP quirks
Mitsubishi's MELSEC family speaks Modbus TCP through a patchwork of add-on modules
and built-in Ethernet ports, not a single unified stack. The module names are
confusingly similar (`QJ71MB91` is *serial* RTU, `QJ71MT91` is the TCP/IP module
[9]; `LJ71MT91` is the L-series equivalent; `RJ71EN71` is the iQ-R Ethernet module
with a MODBUS/TCP *slave* mode bolted on [8]; `FX3U-ENET`, `FX3U-ENET-P502`,
`FX3U-ENET-ADP`, `FX3GE` built-in, and `FX5U` built-in are all different code
paths) — and every one of the categories below has at least one trap a textbook
Modbus client gets wrong: hex-numbered X/Y devices colliding with decimal Modbus
addresses, a user-defined "device assignment" parameter block that means *no two
sites are identical*, CDAB-vs-ABCD word order driven by how the ladder built the
32-bit value, sub-spec FC16 caps on the older QJ71MT91, and an FX3U port-502
licensing split that makes `FX3U-ENET` and `FX3U-ENET-P502` different SKUs.
This document catalogues each quirk, cites primary sources, and names the
ModbusPal integration test we'd write for it (convention from
`docs/v2/modbus-test-plan.md`: `Mitsubishi_<model>_<behavior>`).
## Models and server/client capability
| Model | Family | Modbus TCP server | Modbus TCP client | Source |
|------------------------|----------|-------------------|-------------------|--------|
| `QJ71MT91` | MELSEC-Q | Yes (slave) | Yes (master) | [9] |
| `QJ71MB91` | MELSEC-Q | **Serial only** — RS-232/422/485 RTU, *not TCP* | — | [1][3] |
| `LJ71MT91` | MELSEC-L | Yes (slave) | Yes (master) | [10] |
| `RJ71EN71` / `RnENCPU` | MELSEC iQ-R | Yes (slave) | Yes (master) | [8] |
| `RJ71C24` / `RJ71C24-R2` | MELSEC iQ-R | RTU (serial) | RTU (serial) | [13] |
| iQ-R built-in Ethernet | CPU | Yes (slave) | Yes (master) | [7] |
| iQ-F `FX5U` built-in Ethernet | CPU | Yes, firmware ≥ 1.060 [11] | Yes | [7][11][12] |
| `FX3U-ENET` | FX3U bolt-on | Yes (slave), but **not on port 502** [5] | Yes | [4][5] |
| `FX3U-ENET-P502` | FX3U bolt-on | Yes (slave), port 502 enabled | Yes | [5] |
| `FX3U-ENET-ADP` | FX3U adapter | **No MODBUS** [5] | No MODBUS | [5] |
| `FX3GE` built-in | FX3GE CPU | No MODBUS (needs ENET module) [6] | No | [6] |
| `FX3G` + `FX3U-ENET` | FX3G | Yes via ENET module | Yes | [6] |
- A common integration mistake is to buy `FX3U-ENET-ADP` expecting MODBUS —
that adapter speaks only MC protocol / SLMP. Our driver should surface a clear
capability error, not "connection refused", when the operator's device tag
says `FX3U-ENET-ADP` [5].
- Older forum threads assert the FX5U is "client only" [12] — that was true on
firmware ≤ 1.040. Firmware 1.060 and later ship the parameter-driven MODBUS
TCP server built-in and need no function blocks [11].
## Modbus device assignment (the parameter block)
Unlike a DL260 where the CPU exposes a *fixed* V-memory-to-Modbus mapping, every
MELSEC MODBUS-TCP module exposes a **Modbus Device Assignment Parameter** block
that the engineer configures in GX Works2 / GX Configurator-MB / GX Works3.
Each of the four Modbus tables (Coil, Input, Input Register, Holding Register)
can be split into up to 16 independent "assignment" entries, each binding a
contiguous Modbus address range to a MELSEC device head (`M0`, `D0`, `X0`,
`Y0`, `B0`, `W0`, `SM0`, `SD0`, `R0`, etc.) and a point count [3][7][8][9].
- **There is no canonical "MELSEC Modbus mapping"**. Two sites running the same
QJ71MT91 module can expose completely different Modbus layouts. Our driver
must treat the mapping as site-data (config-file-driven), not as a device
profile constant.
- **Default values do exist** — both GX Configurator-MB (for Q/L series) and
GX Works3 (for iQ-R / iQ-F / FX5) ship a "dedicated pattern" default that is
applied when the engineer does not override the assignment. Per the FX5
MODBUS Communication manual (JY997D56101) and the QJ71MT91 manual, the FX5
dedicated default is [3][7][11]:
| Modbus table | Modbus range (0-based) | MELSEC device | Head |
|--------------------|------------------------|---------------|------|
| Coil (FC01/05/15) | 0 7679 | M | M0 |
| Coil | 8192 8959 | Y | Y0 |
| Input (FC02) | 0 7679 | M | M0 |
| Input | 8192 8959 | X | X0 |
| Input Register (FC04) | 0 6143 | D | D0 |
| Holding Register (FC03/06/16) | 0 6143 | D | D0 |
This matches the widely circulated "FC03 @ 0 = D0" convention that shows up
in Ubidots / Ignition / AdvancedHMI integration guides [6][12].
- **X/Y in the default mapping occupy a second, non-zero Modbus range** (8192+
on FX5; similar on Q/L/iQ-R). Driver users who expect "X0 = coil 0" will be
reading M0 instead. Document this clearly.
- **Assignment-range collisions silently disable the slave.** The QJ71MT91
manual states explicitly that if any two of assignments 1-16 duplicate the
head Modbus device number, the slave function is inactive with no clear
error — the module just won't respond [9]. The driver probe will look like a
simple timeout; the site engineer has to open GX Configurator-MB to diagnose.
Test names:
`Mitsubishi_FX5U_default_mapping_coil_0_is_M0`,
`Mitsubishi_FX5U_default_mapping_holding_0_is_D0`,
`Mitsubishi_QJ71MT91_duplicate_assignment_head_disables_slave`.
## X/Y addressing — hex on MELSEC, decimal on Modbus
**MELSEC X (input) and Y (output) device numbers are hexadecimal on Q / L /
iQ-R** and **octal** on FX / iQ-F (with a GX Works3 toggle) [14][15].
- On a Q CPU, `X20` means decimal **32**, not 20. On an FX5U in default (octal)
mode, `X20` means decimal **16**. GX Works3 exposes a project-level option to
display FX5U X/Y in hex to match Q/L/iQ-R convention — the same physical
input is then called `X10` [14].
- The Modbus Device Assignment Parameter block takes the *head device* as a
MELSEC-native number, which is interpreted in the CPU's native base
(hex for Q/L/iQ-R, octal for FX/iQ-F). After that, **Modbus offsets from
the head are plain decimal** — the module does not apply a second hex
conversion [3][9].
- Example (QJ71MT91 on a Q CPU): assignment "Coil 0 = X0, 512 points" exposes
physical `X0` through `X1FF` (hex) as coils 0-511. A client reading coil 32
gets the bit `X20` (hex) — i.e. the 33rd input, not the value at "input 20"
that the operator wrote on the wiring diagram in decimal.
- **Driver bug source**: if the operator's tag configuration says "read X20" and
the driver helpfully converts "20" to decimal 20 → coil offset 20, the
returned bit is actually `X14` (hex) — off by twelve. Our config layer must
preserve the MELSEC-native base that the site engineer sees in GX Works.
- Timers/counters (`T`, `C`, `ST`) are always decimal in MELSEC notation.
Internal relays (`M`, `B`, `L`), data registers (`D`, `W`, `R`, `ZR`),
and special relays/registers (`SM`, `SD`) also decimal. **Only `X` and `Y`
(and on Q/L/iQ-R, `B` link relays and `W` link registers) use hex**, and
the X/Y decision is itself family-dependent [14][15].
Test names:
`Mitsubishi_Q_X_address_is_hex_X20_equals_coil_offset_32`,
`Mitsubishi_FX5U_X_address_is_octal_X20_equals_coil_offset_16`,
`Mitsubishi_W_link_register_is_hex_W10_equals_holding_offset_16`.
## Word order for 32-bit values
MELSEC stores 32-bit ladder values (`DINT`, `DWORD`, `REAL` / single-precision
float) across **two consecutive D-registers, low word first** — i.e., `CDAB`
when viewed as a Modbus register pair [2][6].
```
D100 (low word) : 0xCC 0xDD (big-endian bytes within the word)
D101 (high word) : 0xAA 0xBB
```
A Modbus master reading D100/D101 as a `float` with default (ABCD) word order
gets garbage. Ignition's built-in Modbus driver notes Mitsubishi as a "CDAB
device" specifically for this reason [2].
- **Q / L / iQ-R / iQ-F all agree** — this is a CPU-level convention, not a
module choice. Both the QJ71MT91 manual and the FX5 MODBUS Communication
manual describe 32-bit access by "reading the lower 16 bits from the start
address and the upper 16 bits from start+1" [6][11].
- **Byte order within each register is big-endian** (Modbus standard). The
module does not byte-swap.
- **Configurable?** The MODBUS modules themselves do **not** expose a word-
order toggle; the behavior is fixed to how the CPU laid out the value in the
two D-registers. If the ladder programmer used an `SWAP` instruction or a
union-style assignment, the word order can be whatever they made it — but
for values produced by the standard `D→DBL` and `FLT`/`FLT2` instructions
it is always CDAB [2].
- **FX5U quirk**: the FX5 MODBUS Communication manual tells the programmer to
use the `SWAP` instruction *if* the remote Modbus peer requires
little-endian *byte* ordering (BADC) [11]. This is only relevant when the
FX5U is the Modbus *client*, but it confirms the FX5U's native wire layout
is big-endian-byte / little-endian-word (CDAB) on the server side too.
- **Rumoured exception**: a handful of MrPLC forum threads report iQ-R
RJ71EN71 firmware < 1.05 returning DWORDs in `ABCD` order when accessed via
the built-in Ethernet port's MODBUS slave [8]. _Unconfirmed_; treat as a
per-site test.
Test names:
`Mitsubishi_Float32_word_order_is_CDAB`,
`Mitsubishi_Int32_word_order_is_CDAB`,
`Mitsubishi_FX5U_SWAP_instruction_changes_byte_order_not_word_order`.
## BCD vs binary encoding
**MELSEC stores integer values in D-registers as plain binary two's-complement**,
not BCD [16]. This is the opposite of AutomationDirect DirectLOGIC, where
V-memory defaults to BCD and the ladder must explicitly request binary.
- A ladder `MOV K1234 D100` stores `0x04D2` (1234 decimal) in D100, not
`0x1234`. The Modbus master reads `0x04D2` and decodes it as an integer
directly — no BCD conversion needed [16].
- **Timer / counter current values** (`T0` current value, `C0` count) are
stored in binary as word devices on Q/L/iQ-R/iQ-F. The ladder preset
(`K...`) is also binary [16][17].
- **Timer / counter preset `K` operand in FX3U / earlier FX**: also binary when
loaded from a D-register or a `K` constant. The older A-series CPUs had BCD
presets on some timer types, but MELSEC-Q, L, iQ-R, iQ-F, and FX3U all use
binary presets by default [17].
- The FX3U programming manual dedicates `FNC 18 BCD` and `FNC 19 BIN` to
explicit conversion — their existence confirms that anything in D-registers
that came from a `BCD` instruction output is BCD, but nothing is BCD by
default [17].
- **7-segment display registers** are a common site-specific exception — many
ladders pack `BCD D100` into a D-register so the operator panel can drive
a display directly. Our driver should not assume; expose a per-tag
"encoding = binary | BCD" knob.
Test names:
`Mitsubishi_D_register_stores_binary_not_BCD`,
`Mitsubishi_FX3U_timer_current_value_is_binary`.
## Max registers per request
From the FX5 MODBUS Communication manual Chapter 11 [11]:
| FC | Name | FX5U (built-in) | QJ71MT91 | iQ-R (RJ71EN71 / built-in) | FX3U-ENET |
|----|----------------------------|-----------------|--------------|-----------------------------|-----------|
| 01 | Read Coils | 1-2000 | 1-2000 [9] | 1-2000 [8] | 1-2000 |
| 02 | Read Discrete Inputs | 1-2000 | 1-2000 | 1-2000 | 1-2000 |
| 03 | Read Holding Registers | **1-125** | 1-125 [9] | 1-125 [8] | 1-125 |
| 04 | Read Input Registers | 1-125 | 1-125 | 1-125 | 1-125 |
| 05 | Write Single Coil | 1 | 1 | 1 | 1 |
| 06 | Write Single Register | 1 | 1 | 1 | 1 |
| 0F | Write Multiple Coils | 1-1968 | 1-1968 | 1-1968 | 1-1968 |
| 10 | Write Multiple Registers | **1-123** | 1-123 | 1-123 | 1-123 |
| 16 | Mask Write Register | 1 | not supported | 1 | not supported |
| 17 | Read/Write Multiple Regs | R:1-125, W:1-121 | not supported | R:1-125, W:1-121 | not supported |
- **The FX5U / iQ-R native-port limits match the Modbus spec**: 125 for FC03/04,
123 for FC16 [11]. No sub-spec caps like DL260's 100-register ceiling.
- **QJ71MT91 does not support FC16 (0x16, Mask Write Register) or FC17
(0x17, Read/Write Multiple)** — requesting them returns exception `01`
Illegal Function [9]. FX5U and iQ-R *do* support both.
- **QJ71MT91 device size**: 64k points (65,536) for each of Coil / Input /
Input Register / Holding Register, plus up to 4086k points for Extended
File Register via a secondary assignment range [9].
- **FX3U-ENET / -P502 function code list is a strict subset** of the common
eight (FC01/02/03/04/05/06/0F/10). FC16 and FC17 not supported [4].
Test names:
`Mitsubishi_FX5U_FC03_126_registers_returns_IllegalDataValue`,
`Mitsubishi_FX5U_FC16_124_registers_returns_IllegalDataValue`,
`Mitsubishi_QJ71MT91_FC16_MaskWrite_returns_IllegalFunction`,
`Mitsubishi_QJ71MT91_FC23_ReadWrite_returns_IllegalFunction`.
## Exception codes
MELSEC MODBUS modules return **only the standard Modbus exception codes 01-04**;
no proprietary exception codes are exposed on the wire [8][9][11]. Module-
internal diagnostics (buffer-memory error codes like `7380H`) are logged but
not returned as Modbus exceptions.
| Code | Name | MELSEC trigger |
|------|----------------------|---------------------------------------------------------|
| 01 | Illegal Function | FC17 or FC16 on QJ71MT91/FX3U; FC08 (Diagnostics); FC43 |
| 02 | Illegal Data Address | Modbus address outside any assignment range |
| 03 | Illegal Data Value | Quantity out of per-FC range (see table above); odd coil-byte count |
| 04 | Server Device Failure | See below |
- **04 (Server Failure) triggers on MELSEC**:
- CPU in STOP or PAUSE during a write to an assignment whose "Access from
External Device" permission is set to "Disabled in STOP" [9][11].
*With the default "always enabled" setting the write succeeds in STOP
mode* — another common trap.
- CPU errors (parameter error, watchdog) during any access.
- Assignment points to a device range that is not configured (e.g. write
to `D16384` when CPU D-device size is 12288).
- **Write to a "System Area" device** (e.g., `SD` special registers that are
CPU-reserved read-only) returns `04`, not `02`, on QJ71MT91 and iQ-R — the
assignment is valid, the device exists, but the CPU rejects the write [8][9].
- **FX3U-ENET / -P502** returns `04` on any write attempt while the CPU is in
STOP, regardless of permission settings — the older firmware does not
implement the "Access from External Device" granularity that Q/L/iQ-R/iQ-F
expose [4].
- **No rumour of proprietary codes 05-0B** from MELSEC; operators sometimes
report "exception 0A" but those traces all came from a third-party gateway
sitting between the master and the MELSEC module.
Test names:
`Mitsubishi_QJ71MT91_STOP_mode_write_with_Disabled_permission_returns_ServerFailure`,
`Mitsubishi_QJ71MT91_STOP_mode_write_with_default_permission_succeeds`,
`Mitsubishi_SD_system_register_write_returns_ServerFailure`,
`Mitsubishi_FX3U_STOP_mode_write_always_returns_ServerFailure`.
## Connection behavior
Max simultaneous Modbus TCP clients, per module [7][8][9][11]:
| Model | Max TCP connections | Port 502 | Keepalive | Source |
|----------------------|---------------------|----------|-----------|--------|
| `QJ71MT91` | 16 (shared with master role) | Yes | No | [9] |
| `LJ71MT91` | 16 | Yes | No | [10] |
| iQ-R built-in / `RJ71EN71` | 16 | Yes | Configurable (KeepAlive = ON in parameter) | [8] |
| iQ-F `FX5U` built-in | 8 | Yes | Configurable | [7][11] |
| `FX3U-ENET` | 8 TCP, but **not port 502** | No (port < 1024 blocked) | No | [4][5] |
| `FX3U-ENET-P502` | 8, port 502 enabled | Yes | No | [5] |
- **QJ71MT91's 16 is total connections shared between slave-listen and
master-initiated sockets** [9]. A site that uses the same module as both
master to downstream VFDs and slave to upstream SCADA splits the 16 pool.
- **FX3U-ENET port-502 gotcha**: if the engineer loads a configuration with
port 502 into a non-P502 ENET module, GX Works shows the download as
successful; on next power cycle the module enters error state and the
MODBUS listener never starts. This is documented on third-party FX3G
integration guides [6].
- **CPU STOP → RUN transition**: does **not** drop Modbus connections on any
MELSEC family. Existing sockets stay open; outstanding requests during the
transition may see exception 04 for a few scans but then resume [8][9].
- **CPU reset (power cycle or `SM1255` forced reset)** drops all Modbus
connections and the module re-listens after typically 5-10 seconds.
- **Idle timeout**: QJ71MT91 and iQ-R have a per-connection "Alive-Check"
(idle timer) parameter, default 0 (disabled). If enabled, default 10 s
probe interval, 3 retries before close [8][9]. FX5U similar defaults.
- **Keep-alive (TCP-level)**: only iQ-R / iQ-F expose a TCP keep-alive option
(parameter "KeepAlive" in the Ethernet settings); QJ71MT91 and FX3U-ENET
do not — so NAT/firewall idle drops require driver-side pinging.
Test names:
`Mitsubishi_QJ71MT91_17th_connection_refused`,
`Mitsubishi_FX5U_9th_connection_refused`,
`Mitsubishi_STOP_to_RUN_transition_preserves_socket`,
`Mitsubishi_CPU_reset_closes_all_sockets`.
## Behavioral oddities
- **Transaction ID echo**: QJ71MT91 and iQ-R reliably echo the MBAP TxId on
every response across firmware revisions; no reports of TxId drops under
load [8][9]. FX3U-ENET has an older, less-tested TCP stack; at least one
MrPLC thread reports out-of-order TxId echoes under heavy polling on
firmware < 1.14 [4]. _Unconfirmed_ on current firmware.
- **Per-connection request serialization**: all MELSEC slaves serialize
requests within a single TCP connection — a new request is not processed
until the prior response has been sent. Pipelining multiple requests on one
socket causes the module to queue them in buffer memory and respond in
order, but **the queue depth is 1** on QJ71MT91 (a second in-flight request
is held on the TCP receive buffer, not queued) [9]. Driver should treat
Mitsubishi slaves as strictly single-flight per socket.
- **Partial-frame handling**: QJ71MT91 and iQ-R close the socket on malformed
MBAP length fields. FX5U resynchronises at the next valid MBAP header
within 100 ms but will emit an error to `SD` diagnostics [11]. Driver must
reconnect on half-close and replay.
- **FX3U UDP vs TCP**: `FX3U-ENET` supports both UDP and TCP MODBUS transports;
UDP is lossy and reorders under load. Default is TCP. Some legacy SCADA
configurations pinned the module to UDP for multicast discovery — do not
select UDP unless the site requires it [4].
- **Known firmware-revision variants**:
- QJ71MT91 ≤ firmware 10052000000 (year-month format): FC15 with coil
count that forces byte-count to an odd value silently truncates the
last coil. Fixed in later revisions [9]. _Operator-reported_.
- FX5U firmware < 1.060: no native MODBUS TCP server — only accessible via
a predefined-protocol function block hack. Firmware ≥ 1.060 ships
parameter-based server. Our capability probe should read `SD203`
(firmware version) and flag < 1.060 as unsupported for server mode [11][12].
- iQ-R RJ71EN71 early firmware: possible ABCD word order (rumoured,
unconfirmed) [8].
- **SD (special-register) reads during assignment-parameter load**: while
the CPU is loading a new MODBUS device assignment parameter (~1-2 s), the
slave returns exception 04 Server Failure on every request. Happens after
a parameter write from GX Configurator-MB [9].
- **iQ-R "Station-based block transfer" collision**: if the RJ71EN71 is also
running CC-Link IE Control on the same module, a MODBUS/TCP request that
arrives during a CCIE cyclic period is delayed to the next scan — visible
as jittery response time, not a failure [8].
Test names:
`Mitsubishi_QJ71MT91_single_flight_per_socket`,
`Mitsubishi_FX5U_malformed_MBAP_resync_within_100ms`,
`Mitsubishi_FX3U_TxId_preserved_across_burst`,
`Mitsubishi_FX5U_firmware_below_1_060_reports_no_server_mode`.
## Model-specific differences for test coverage
Summary of which quirks differ per model, so test-class naming can reflect them:
| Quirk | QJ71MT91 | LJ71MT91 | iQ-R (RJ71EN71 / built-in) | iQ-F (FX5U) | FX3U-ENET(-P502) |
|------------------------------------------|----------|----------|----------------------------|-------------|------------------|
| FC16 Mask-Write supported | No | No | Yes | Yes | No |
| FC17 Read/Write Multiple supported | No | No | Yes | Yes | No |
| Max connections | 16 | 16 | 16 | 8 | 8 |
| X/Y numbering base | hex | hex | hex | octal (default) | octal |
| 32-bit word order | CDAB | CDAB | CDAB (firmware-dependent rumour of ABCD) | CDAB | CDAB |
| Port 502 supported | Yes | Yes | Yes | Yes | P502 only |
| STOP-mode write permission configurable | Yes | Yes | Yes | Yes | No (always blocks) |
| TCP keep-alive parameter | No | No | Yes | Yes | No |
| Modbus device assignment — max entries | 16 | 16 | 16 | 16 | 8 |
| Server via parameter (no FB) | Yes | Yes | Yes | Yes (fw ≥ 1.060) | Yes |
- **Test file layout**: `Mitsubishi_QJ71MT91_*`, `Mitsubishi_LJ71MT91_*`,
`Mitsubishi_iQR_*`, `Mitsubishi_FX5U_*`, `Mitsubishi_FX3U_ENET_*`,
`Mitsubishi_FX3U_ENET_P502_*`. iQ-R built-in Ethernet and the RJ71EN71
behave identically for MODBUS/TCP slave purposes and can share a file
`Mitsubishi_iQR_*`.
- **Cross-model shared tests** (word order CDAB, binary not BCD, standard
exception codes, 125-register FC03 cap) can live in a single
`Mitsubishi_Common_*` fixture.
## References
1. Mitsubishi Electric, *MODBUS Interface Module User's Manual — QJ71MB91*
(SH-080578ENG), RS-232/422/485 MODBUS RTU serial module for MELSEC-Q —
https://dl.mitsubishielectric.com/dl/fa/document/manual/plc/sh080578eng/sh080578engk.pdf
2. Inductive Automation, *Ignition Modbus Driver — Mitsubishi Q / iQ-R word
order*, documents CDAB convention —
https://docs.inductiveautomation.com/docs/8.1/ignition-modules/opc-ua/drivers/modbus-v2
and forum discussion https://forum.inductiveautomation.com/t/modbus-tcp-device-word-byte-order/65984
3. Mitsubishi Electric, *Programmable Controller User's Manual QJ71MB91 MODBUS
Interface Module*, Chapter 7 "Parameter Setting" describing the Modbus
Device Assignment Parameter block (assignments 1-16, head-device
configuration) —
https://www.lcautomation.com/dbdocument/29156/QJ71MB91%20Users%20manual.pdf
4. Mitsubishi Electric, *FX3U-ENET User's Manual* (JY997D18101), Chapter on
MODBUS/TCP communication; function code support and connection limits —
https://dl.mitsubishielectric.com/dl/fa/document/manual/plc_fx/jy997d18101/jy997d18101h.pdf
5. Venus Automation, *Mitsubishi FX3U-ENET-P502 Module — Open Port 502 for
Modbus TCP/IP* —
https://venusautomation.com.au/mitsubishi-fx3u-enet-p502-module-open-port-502-for-modbus-tcp-ip/
and FX3U-ENET-ADP user manual (JY997D45801), which confirms the -ADP
variant does not support MODBUS —
https://dl.mitsubishielectric.com/dl/fa/document/manual/plc_fx/jy997d45801/jy997d45801h.pdf
6. XML Control / Ubidots integration notes, *FX3G Modbus* — port-502 trap,
D-register mapping default, word order reference —
https://sites.google.com/site/xmlcontrol/archive/fx3g-modbus
and https://ubidots.com/blog/mitsubishi-plc-as-modbus-tcp-server/
7. FA Support Me, *Modbus TCP on Built-in Ethernet port in iQ-F and iQ-R*
confirms 16-connection limit on iQ-R, 8 on iQ-F, parameter-driven
configuration via GX Works3 —
https://www.fasupportme.com/portal/en/kb/articles/modbus-tcp-on-build-in-ethernet-port-in-iq-f-and-iq-r-en
8. Mitsubishi Electric, *MELSEC iQ-R Ethernet User's Manual (Application)*
(SH-081259ENG) and *MELSEC iQ-RJ71EN71 User's Manual* Chapter on
"Communications Using Modbus/TCP" —
https://www.allied-automation.com/wp-content/uploads/2015/02/MITSUBISHI_manual_plc_iq-r_ethernet_users.pdf
and https://www.manualslib.com/manual/1533351/Mitsubishi-Electric-Melsec-Iq-Rj71en71.html?page=109
9. Mitsubishi Electric, *MODBUS/TCP Interface Module User's Manual — QJ71MT91*
(SH-080446ENG), exception codes page 248, device assignment parameter
pages 116-124, duplicate-assignment-disables-slave note —
https://dl.mitsubishielectric.com/dl/fa/document/manual/plc/sh080446eng/sh080446engj.pdf
10. Mitsubishi Electric, *MELSEC-L Network Features* — LJ71MT91 documented as
L-series equivalent of QJ71MT91 with identical MODBUS/TCP behavior —
https://us.mitsubishielectric.com/fa/en/products/cnt/programmable-controllers/melsec-l-series/network/features/
11. Mitsubishi Electric, *MELSEC iQ-F FX5 User's Manual (MODBUS Communication)*
(JY997D56101), Chapter 11 "Modbus/TCP Communication Specifications" —
function code max-quantity table, frame specification, device assignment
defaults —
https://dl.mitsubishielectric.com/dl/fa/document/manual/plcf/jy997d56101/jy997d56101h.pdf
12. MrPLC forum, *FX5U Modbus-TCP Server (Slave)*, firmware ≥ 1.60 enables
native server via parameter; earlier firmware required function block —
https://mrplc.com/forums/topic/31883-fx5u-modbus-tcp-server-slave/
and Industrial Monitor Direct's "FX5U MODBUS TCP Server Workaround"
article (reflects older firmware behavior) —
https://industrialmonitordirect.com/blogs/knowledgebase/mitsubishi-fx5u-modbus-tcp-server-configuration-workaround
13. Mitsubishi Electric, *MELSEC iQ-R MODBUS and MODBUS/TCP Reference Manual —
RJ71C24 / RJ71C24-R2* (BCN-P5999-1060) — RJ71C24 is serial RTU only,
not TCP —
https://dl.mitsubishielectric.com/dl/fa/document/manual/plc/bcn-p5999-1060/bcnp59991060b.pdf
14. HMS Industrial Networks, *eWON and Mitsubishi FX5U PLC* (KB-0264-00) —
documents that FX5U X/Y are octal in GX Works3 but hex when viewed as a
Q-series PLC through eWON; the project-level hex/octal toggle —
https://hmsnetworks.blob.core.windows.net/www/docs/librariesprovider10/downloads-monitored/manuals/knowledge-base/kb-0264-00-en-ewon-and-mitsubishi-fx5u-plc.pdf
15. Fernhill Software, *Mitsubishi Melsec PLC Data Address* — documents
hex-vs-octal device numbering split across MELSEC families —
https://www.fernhillsoftware.com/help/drivers/mitsubishi-melsec/data-address-format.html
16. Inductive Automation support, *Understanding Mitsubishi PLCs* — D registers
store signed 16-bit binary, not BCD; DINT combines two consecutive D
registers —
https://support.inductiveautomation.com/hc/en-us/articles/16517576753165-Understanding-Mitsubishi-PLCs
17. Mitsubishi Electric, *FXCPU Structured Programming Manual [Device &
Common]* (JY997D26001) — FNC 18 BCD and FNC 19 BIN explicit-conversion
instructions confirm binary-by-default storage —
https://dl.mitsubishielectric.com/dl/fa/document/manual/plc_fx/jy997d26001/jy997d26001l.pdf

View File

@@ -909,6 +909,26 @@ Each step leaves the system runnable. The generic extraction is effectively free
| 140 | Enterprise shortname = `zb` (UNS level-1 segment) | Closes corrections-doc D4. Matches the existing `ZB.MOM.WW.*` namespace prefix used throughout the codebase; short by design since this segment appears in every equipment path (`zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState`); operators already say "ZB" colloquially. Admin UI cluster-create form default-prefills `zb` for the Enterprise field. Production deployments use it directly from cluster-create | 2026-04-17 |
| 141 | Tier 3 (AppServer IO) cutover is feasible — AVEVA's OI Gateway supports arbitrary upstream OPC UA servers as a documented pattern | Closes corrections-doc E2 with **GREEN-YELLOW** verdict. Multiple AVEVA partners (Software Toolbox, InSource) have published working integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub). No re-architecting of OtOpcUa required. Path: `OPC UA node → OI Gateway → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute`. Recommended AppServer floor: System Platform 2023 R2 Patch 01. Two integrator-burden risks tracked: validation/GxP paperwork (no AVEVA blueprint exists for non-AVEVA upstream servers in Part 11 deployments) and unpublished scale benchmarks (in-house benchmark required before cutover scheduling). See `aveva-system-platform-io-research.md` | 2026-04-17 |
| 142 | Phase 1 acceptance includes an end-to-end AppServer-via-OI-Gateway smoke test against OtOpcUa | Catches AppServer-specific quirks (cert exchange via reject-and-trust workflow, endpoint URL must NOT include `/discovery` suffix per Inductive Automation forum failure mode, service-account install required because OI Gateway under SYSTEM cannot connect to remote OPC servers, `Basic256Sha256` + `SignAndEncrypt` + LDAP-username token combination must work end-to-end) early — well before the Year 3 tier-3 cutover schedule. Adds one task to `phase-1-configuration-and-admin-scaffold.md` Stream E (Admin smoke test) | 2026-04-17 |
| 143 | Polly per-capability policy — Read / HistoryRead / Discover / Probe / Alarm-subscribe auto-retry; Write does NOT auto-retry unless the tag metadata carries `[WriteIdempotent]` | Decisions #44-45 forbid auto-retry on Write because a timed-out write can succeed on the device + be replayed by the pipeline, duplicating pulses / alarm acks / counter increments / recipe-step advances. Per-capability policy in the shared Polly layer makes the retry safety story explicit; `WriteIdempotentAttribute` on tag definitions is the opt-in surface | 2026-04-19 |
| 144 | Polly pipeline key = `(DriverInstanceId, HostName)`, not DriverInstanceId alone | Decision #35 requires per-device isolation. One dead PLC behind a multi-device Modbus driver must NOT open the circuit breaker for healthy sibling hosts. Per-instance pipelines would poison every device behind one bad endpoint | 2026-04-19 |
| 145 | Tier A/B/C runtime enforcement splits into `MemoryTracking` (all tiers — soft/hard thresholds log + surface, NEVER kill) and `MemoryRecycle` (Tier C only — requires out-of-process topology). Tier A/B hard-breach logs a promotion-to-Tier-C recommendation; the runtime never auto-kills an in-process driver | Decisions #73-74 reserve process-kill protections for Tier C. An in-process Tier A/B "recycle" would kill every OPC UA session + every other in-proc driver for one leaky instance, blast-radius worse than the leak | 2026-04-19 |
| 146 | Memory watchdog uses the hybrid formula `soft = max(multiplier × baseline, baseline + floor)`, with baseline captured as the median of the first 5 min of `GetMemoryFootprint()` samples post-InitializeAsync. Tier-specific constants: A multiplier=3 floor=50 MB, B multiplier=3 floor=100 MB, C multiplier=2 floor=500 MB. Hard = 2 × soft | Codex adversarial review on the Phase 6.1 plan flagged that hardcoded per-tier MB bands diverge from decision #70's specified formula. Static bands false-trigger on small-footprint drivers + miss meaningful growth on large ones. Observed-baseline + hybrid formula recovers the original intent | 2026-04-19 |
| 147 | `WedgeDetector` uses demand-aware criteria `(state==Healthy AND hasPendingWork AND noProgressIn > threshold)`. `hasPendingWork` = (Polly bulkhead depth > 0) OR (active MonitoredItem count > 0) OR (queued historian read count > 0). Idle + subscription-only + write-only-burst drivers stay Healthy without false-fault | Previous "no successful Read in N intervals" formulation flipped legitimate idle subscribers, slow historian backfills, and write-heavy drivers to Faulted. The demand-aware check only fires when the driver claims work is outstanding | 2026-04-19 |
| 148 | LiteDB config cache is **generation-sealed**: `sp_PublishGeneration` writes `<cache-root>/<cluster>/<generationId>.db` as a read-only sealed file; cache reads serve the last-known-sealed generation. Mixed-generation reads are impossible | Prior "refresh on every successful query" cache could serve LDAP role mapping from one generation alongside UNS topology from another, producing impossible states. Sealed-snapshot invariant keeps cache-served reads coherent with a real published state | 2026-04-19 |
| 149 | `AuthorizationDecision { Allow \| NotGranted \| Denied, IReadOnlyList<MatchedGrant> Provenance }` — tri-state internal model. Phase 6.2 only produces `Allow` + `NotGranted` (grant-only semantics per decision #129); v2.1 Deny widens without API break | bool return would collapse `no-matching-grant` and `explicit-deny` into the same runtime state + UI explanation; provenance record is needed for the audit log anyway. Making the shape tri-state from Phase 6.2 avoids a breaking change in v2.1 | 2026-04-19 |
| 150 | Data-plane ACL evaluator consumes `NodeAcl` rows joined against the session's resolved LDAP group memberships. `LdapGroupRoleMapping` (decision #105) is control-plane only — routes LDAP groups to Admin UI roles. Zero runtime overlap between the two | Codex adversarial review flagged that Phase 6.2 draft conflated the two — building the data-plane trie from `LdapGroupRoleMapping` would let a user inherit tag permissions from an admin-role claim path never intended as a data-path grant | 2026-04-19 |
| 151 | `UserAuthorizationState` cached per session but bounded by `MembershipFreshnessInterval` (default 15 min). Past that interval the next hot-path authz call re-resolves LDAP group memberships; failure to re-resolve (LDAP unreachable) → fail-closed (evaluator returns `NotGranted` until memberships refresh successfully) | Previous design cached memberships until session close, so a user removed from a privileged LDAP group could keep authorized access for hours. Bounded freshness + fail-closed covers the revoke-takes-effect story | 2026-04-19 |
| 152 | Auth cache has its own staleness budget `AuthCacheMaxStaleness` (default 5 min), independent of decision #36's availability-oriented config cache (24 h). Past 5 min on authorization data, evaluator fails closed regardless of whether the underlying config is still serving from cache | Availability-oriented caches trade correctness for uptime. Authorization data is correctness-sensitive — stale ACLs silently extend revoked access. Auth-specific budget keeps the two concerns from colliding | 2026-04-19 |
| 153 | MonitoredItem carries `(AuthGenerationId, MembershipVersion)` stamp at create time. On every Publish, items with a mismatching stamp re-evaluate; unchanged items stay fast-path. Revoked items drop to `BadUserAccessDenied` within one publish cycle | Create-time-only authorization leaves revoked users receiving data forever; per-publish re-authorization at 100 ms cadence across 50 groups × 6 levels is too expensive. Stamp-then-reevaluate-on-change balances correctness with cost | 2026-04-19 |
| 154 | ServiceLevel reserves `0` for operator-declared maintenance only; `1` = NoData (unreachable / Faulted); operational states occupy `2..255` in an 8-state matrix (Authoritative-Primary=255, Isolated-Primary=230, Primary-Mid-Apply=200, Recovering-Primary=180, Authoritative-Backup=100, Isolated-Backup=80, Backup-Mid-Apply=50, Recovering-Backup=30, InvalidTopology=2) | OPC UA Part 5 §6.3.34 defines `0=Maintenance` + `1=NoData`; using `0` for our Faulted case collides with spec + triggers spec-compliant clients to enter maintenance-mode cutover. Expanded 8-state matrix covers operational states the 5-state original collapsed together (e.g. Isolated-Primary vs Primary-Mid-Apply were both 200) | 2026-04-19 |
| 155 | `ServerUriArray` includes self + peers (self first, deterministic ordering), per OPC UA Part 4 §6.6.2.2 | Previous design excluded self from the array — spec violation + clients lose the ability to map server identities consistently during failover | 2026-04-19 |
| 156 | Redundancy peer health uses a two-layer probe: `/healthz` (2 s) as fast-fail + `UaHealthProbe` (10 s, opens OPC UA client session to peer + reads its `ServiceLevel` node) as the authority signal. HTTP-healthy ≠ UA-authoritative | `/healthz` returns 200 whenever HTTP + config DB/cache is healthy — but a peer can be HTTP-healthy with a broken OPC UA endpoint or a stuck subscription publisher. Using HTTP alone would advertise authority against servers that can't actually publish data | 2026-04-19 |
| 157 | Publish-generation fencing — coordinator CAS on a monotonic `ConfigGenerationId`; every topology + role decision is generation-stamped; peers reject state propagated from a lower generation. Runtime `InvalidTopology` state (both self-demote to ServiceLevel 2) when >1 Primary detected post-startup | Operator race publishing two drafts with different roles can produce two locally-valid views; without fencing + runtime containment both nodes can serve as Primary until manual intervention | 2026-04-19 |
| 158 | Apply-window uses named leases keyed to `(ConfigGenerationId, PublishRequestId)` via `await using`. `ApplyLeaseWatchdog` auto-closes any lease older than `ApplyMaxDuration` (default 10 min) | Simple `IDisposable`-counter design leaks on cancellation / async-ownership races; a stuck positive count leaves the node permanently mid-apply. Generation-keyed leases + watchdog bound worst case | 2026-04-19 |
| 159 | CSV import header row must start with `# OtOpcUaCsv v1` (version marker). Future shape changes bump the version; parser forks per version. Canonical identifier columns follow decision #117: `ZTag, MachineCode, SAPID, EquipmentId, EquipmentUuid` | Without a version marker the CSV schema has no upgrade path — adding a required column breaks every old export silently. The version prefix makes parser dispatch explicit + future-compatible | 2026-04-19 |
| 160 | Equipment CSV import uses a staged-import pattern: `EquipmentImportBatch` + `EquipmentImportRow` tables receive chunked inserts; `FinaliseImportBatch` is one atomic transaction that applies accepted rows to `Equipment` + `ExternalIdReservation`. Rollback = drop the batch row; `Equipment` never partially mutates | 10k-row single-transaction import holds locks too long; chunked direct writes lose all-or-nothing rollback. Staging + atomic finalize bounds transaction duration + preserves rollback semantics | 2026-04-19 |
| 161 | UNS drag-reorder impact preview carries a `DraftRevisionToken`; Confirm re-checks against the current draft + returns `409 Conflict / refresh-required` if the draft advanced between preview and commit | Without concurrency control, two operators editing the same draft can overwrite each other's changes silently. Draft-revision token + 409 response makes the race visible + forces refresh | 2026-04-19 |
| 162 | OPC 40010 Identification sub-folder exposed under each equipment node inherits the Equipment scope's ACL grants — the ACL trie does NOT add a new scope level for Identification | Adding a new scope level for Identification would require every grant to add a second grant for `Equipment/Identification`; inheriting the Equipment scope keeps the grant model flat + prevents operator-forgot-to-grant-Identification access surprises | 2026-04-19 |
## Reference Documents

485
docs/v2/s7.md Normal file
View File

@@ -0,0 +1,485 @@
# Siemens SIMATIC S7 (S7-1200 / S7-1500 / S7-300 / S7-400 / ET 200SP) — Modbus TCP quirks
Siemens S7 PLCs do *not* speak Modbus TCP natively at the OS/firmware level. Every
S7 Modbus-TCP-server deployment is either (a) the **`MB_SERVER`** library block
running on the CPU's PROFINET port (S7-1200 / S7-1500 / CPU 1510SP-series
ET 200SP), or (b) the **`MODBUSCP`** function block running on a separate
communication processor (**CP 343-1 / CP 343-1 Lean** on S7-300, **CP 443-1** on
S7-400), or (c) the **`MODBUSPN`** block on an S7-1500 PN port via a licensed
library. That means the quirks a Modbus client has to cope with are as much
"this is how the user's PLC programmer wired the library block up" as "this is
how the firmware behaves" — the byte-order and coil-mapping rules aren't
hard-wired into silicon like they are on a DL260. This document catalogues the
behaviours a driver has to handle across the supported model/CP variants, cites
primary sources, and names the ModbusPal integration test we'd write for each
(convention from `docs/v2/modbus-test-plan.md`: `S7_<model>_<behavior>`).
## Model / CP Capability Matrix
| PLC family | Modbus TCP server mechanism | Modbus TCP client mechanism | License required? | Typical port 502 source |
|---------------------|------------------------------------|------------------------------------|-----------------------|-----------------------------------------------------------|
| S7-1200 (V4.0+) | `MB_SERVER` on integrated PN port | `MB_CLIENT` | No (in TIA Portal) | CPU's onboard Ethernet [1][2] |
| S7-1500 (all) | `MB_SERVER` on integrated PN port | `MB_CLIENT` | No (in TIA Portal) | CPU's onboard Ethernet [1][3] |
| S7-1500 + CP 1543-1 | `MB_SERVER` on CP's IP | `MB_CLIENT` | No | Separate CP IP address [1] |
| ET 200SP CPU (1510SP, 1512SP) | `MB_SERVER` on PN port | `MB_CLIENT` | No | CPU's onboard Ethernet [3] |
| S7-300 + CP 343-1 / CP 343-1 Lean | `MODBUSCP` (FB `MODBUSCP`, instance DB per connection) | Same FB, client mode | **Yes — 2XV9450-1MB00** per CP | CP's Ethernet port [4][5] |
| S7-400 + CP 443-1 | `MODBUSCP` | `MODBUSCP` client mode | **Yes — 2XV9450-1MB00** per CP | CP's Ethernet port [4] |
| S7-400H + CP 443-1 (redundant H) | `MODBUSCP_REDUNDANT` / paired FBs | Not typical | Yes | Paired CPs in H-system [6] |
| S7-300 / S7-400 CPU PN (e.g. CPU 315-2 PN/DP) | `MODBUSPN` library | `MODBUSPN` client mode | **Yes** — Modbus-TCP PN CPU lib | CPU's PN port [7] |
| "CP 343-1 Lean" | **Server only** (no client mode supported by Lean) | — | Yes, but with restrictions | CP's Ethernet port [4][5] |
- **CP 343-1 Lean is server-only.** It can host `MODBUSCP` in server mode only;
client calls return an immediate error. A surprising number of "Lean + client
doesn't work" forum posts trace back to this [5].
- **Pure OPC UA / PROFINET CPs (CP 1542SP-1, CP 1543-1)** support Modbus TCP on
S7-1500 via the same `MB_SERVER`/`MB_CLIENT` instructions by passing the
CP's `hw_identifier`. There is no separate "Modbus CP" license needed on
S7-1500, unlike S7-300/400 [1].
- **No S7 Modbus server supports function codes 20/21 (file records),
22 (mask write), 23 (read-write multiple), or 43 (device identification).**
Sending any of these returns exception `01` (Illegal Function) on every S7
variant [1][4]. Our driver must not negotiate FC23 as a "bulk-read optimization"
when the profile is S7.
Test names:
`S7_1200_MBSERVER_Loads_OB1_Cyclic`,
`S7_CP343_Lean_Client_Mode_Rejected`,
`S7_All_FC23_Returns_IllegalFunction`.
## Address / DB Mapping
S7 Modbus servers **do not auto-expose PLC memory** — the PLC programmer has to
wire one area per Modbus table to a DB or process-image region. This is the
single biggest difference vs. DL205/Modicon/etc., where the memory map is
fixed at the factory. Our driver must therefore be tolerant of "the same
`40001` means completely different things on two S7-1200s on the same site."
### S7-1200 / S7-1500 `MB_SERVER`
The `MB_SERVER` instance exposes four Modbus tables to each connected client;
each table's backing storage is a per-block parameter [1][8]:
| Modbus table | FCs | Backing parameter | Default / typical backing |
|---------------------|-------------|-----------------------------|-----------------------------|
| Coils (0x) | FC01, FC05, FC15 | *implicit* — Q process image | `%Q0.0``%Q1023.7` (→ coil addresses 08191) [1][9] |
| Discrete Inputs (1x)| FC02 | *implicit* — I process image | `%I0.0``%I1023.7` (→ discrete addresses 08191) [1][9] |
| Input Registers (3x)| FC04 | *implicit* — M memory or DB (version-dependent) | Some firmware routes FC04 through the same MB_HOLD_REG buffer [1][8] |
| Holding Registers (4x)| FC03, FC06, FC16 | `MB_HOLD_REG` pointer | User DB (e.g. `DB10.DBW0`) or `%MW` area [1][2][8] |
- **`MB_HOLD_REG` is a pointer (VARIANT / ANY) into a user-defined DB** whose
first byte is holding-register 0 (`40001` in 1-based Modicon form). Byte
offset 2 is register 1, byte offset 4 is register 2, etc. [1][2].
- **The DB *must* have "Optimized block access" UNCHECKED.** Optimized DBs let
the compiler reorder fields for alignment; Modbus requires fixed byte
offsets. With optimized access on, the compiler accepts the project but
`MB_SERVER` returns STATUS `0x8383` (misaligned access) or silently reads
zeros [8][10][11]. This is the #1 support-forum complaint.
- **FC01/FC02/FC05/FC15 hit the Q and I process images directly — not the
`MB_HOLD_REG` DB.** Coil address 0 = `%Q0.0`, coil 1 = `%Q0.1`, coil 8 =
`%Q1.0`. The S7-1200 system manual publishes this mapping as `00001 → Q0.0`
through `09999 → Q1023.7` and `10001 → I0.0` through `19999 → I1023.7` in
1-based form; on the wire (0-based) that's coils 0-8191 and discrete inputs
0-8191 [9].
- **`%M` markers are NOT automatically exposed.** To expose `%M` over Modbus
the programmer must either (a) copy `%M` to the `MB_HOLD_REG` DB each scan,
or (b) define an Array\[0..n\] of Bool inside that DB and copy bits in/out
of `%M`. Siemens has no "MB_COIL_REG" parameter analogous to
`MB_HOLD_REG` — this confuses users migrating from Schneider [9][12].
- **Bit ordering within a Modbus holding register sourced from an `Array of
Bool`**: S7 stores bool\[0\] at `DBX0.0` which is bit 0 of byte 0 which is
the **low byte, low bit** of Modbus register `40001`. A naive client that
reads register `40001` and masks `0x0001` gets bool\[0\]. A client that
masks `0x8000` gets bool\[15\] because the high byte of the Modbus register
is the *second* byte of the DB. Siemens programmers routinely get this
wrong in the DB-via-DBX form; `Array[0..n] of Bool` is the recommended
layout because it aligns naturally [12][13].
### S7-300/400 + CP 343-1 / CP 443-1 `MODBUSCP`
Different paradigm: per-connection **parameter DB** (template
`MODBUS_PARAM_CP`) declares a table of up to 8 register-area mappings. Each
mapping is a tuple `(data_type, DB#, start_offset, length)` where `data_type`
picks the Modbus table [4]:
- `B#16#1` = Coils
- `B#16#2` = Discrete Inputs
- `B#16#3` = Holding Registers
- `B#16#4` = Input Registers
The `holding_register_start` and analogous `coils_start` parameters declare
**which Modbus address range** the CP will serve, and the DB pointers say
where in S7 memory that range lives [4][14]. Unlike `MB_SERVER`, the CP does
not reach into `%Q`/`%I` directly — *everything* goes through a DB. If an
address outside the declared ranges is requested, the CP returns exception
`02` (Illegal Data Address) [4].
Test names:
`S7_1200_FC03_Reg0_Reads_DB10_DBW0`,
`S7_1200_Optimized_DB_Returns_0x8383_MisalignedAccess`,
`S7_1200_FC01_Coil0_Reads_Q0_0`,
`S7_CP343_FC03_Outside_ParamBlock_Range_Returns_IllegalDataAddress`.
## Data Types and Byte Order
Siemens CPUs store scalars **big-endian** internally ("Motorola format"), which
is the same byte order Modbus specifies inside each register. So for 16-bit
values (`Int`, `Word`, `UInt`) the on-the-wire layout is straightforward
`AB` — high byte of the PLC value in the high byte of the Modbus register
[15][16]. No byte-swap trap for 16-bit types.
The trap is 32-bit types (`DInt`, `DWord`, `Real`). Here's what actually
happens across the S7 family:
### S7-1200 / S7-1500 `MB_SERVER`
- **The backing DB stores 32-bit values in big-endian byte order, high word
first** — i.e. `ABCD` when viewed as two consecutive Modbus registers. A
`Real` at `DB10.DBD0` with value `0x12345678` reads over Modbus as
register 0 = `0x1234`, register 1 = `0x5678` [15][16][17].
- **This is `ABCD`, *not* `CDAB`.** Clients that hard-code CDAB (common default
for meters and VFDs) will get wildly wrong floats. Configure the S7 profile
with `WordOrder = ABCD` (aka "big-endian word + big-endian byte" aka
"high-word first") [15][17].
- **`MB_SERVER` does not swap.** It's a direct memcpy from the DB bytes to
the Modbus payload. Whatever byte order the ladder programmer stored into
the DB is what the client receives [17]. This means a programmer who used
`MOVE_BLK` from two separate `Word`s into `DBD` with the "wrong" order can
produce `CDAB` without realising.
- **`Real` is IEEE 754 single-precision** — unambiguous, no BCD trap like on
DL series [15].
- **Strings**: S7 `String[n]` has a 2-byte header (max length, current length)
*before* the character bytes. A client reading a string over Modbus gets
the header in the first register and then the characters two-per-register
in high-byte-first order. `WString` is UTF-16 and the header is 4 bytes
[18]. Our driver's string decoder must expose the "skip header" option for
S7 profile.
### S7-300/400 `MODBUSCP` (CP 343-1 / CP 443-1)
- The CP writes the exact DB bytes onto the wire — again `ABCD` if the DB
stores `DInt`/`Real` in native Siemens order [4].
- **`MODBUSCP` has no `data_type` byte-swap knob.** (The `data_type` parameter
names the Modbus table, not the byte order — see the Address Mapping
section.) If the other end of the link expects `CDAB`, the programmer has
to swap words in ladder before writing the DB [4][14].
### Operator-reported oddity
- Some S7 drivers (Kepware's "Siemens TCP/IP Ethernet" driver, Ignition's
"Siemens S7" driver) expose a per-tag `Float Byte Order` with options
`ABCD`/`CDAB`/`BADC`/`DCBA` because end-users have encountered every
permutation in the field — not because the PLC natively swaps, but because
ladder programmers have historically stored floats every which way [19].
Our S7 Modbus profile should default to `ABCD` but expose a per-tag
override.
- **Unconfirmed rumour**: that S7-1500 firmware V2.0+ reverses float byte
order for `MB_CLIENT` only. Not reproduced; the Siemens forum thread that
launched it was a user error (the remote server was the swapper, not the
S7) [20]. Treat as false until proven.
Test names:
`S7_1200_Real_WordOrder_ABCD_Default`,
`S7_1200_DInt_HighWord_First_At_DBD0`,
`S7_1200_String_Header_First_Two_Bytes`,
`S7_CP343_No_Internal_ByteSwap`.
## Coil / Discrete Input Mapping
On `MB_SERVER` the mapping from coil address → S7 bit is fixed at the
process-image level [1][9][12]:
| Modbus coil / discrete input addr | S7 address | Notes |
|-----------------------------------|---------------|-------------------------------------|
| Coil 0 (FC01/05/15) | `%Q0.0` | bit 0 of output byte 0 |
| Coil 7 | `%Q0.7` | bit 7 of output byte 0 |
| Coil 8 | `%Q1.0` | bit 0 of output byte 1 |
| Coil 8191 (max) | `%Q1023.7` | highest exposed output bit |
| Discrete input 0 (FC02) | `%I0.0` | bit 0 of input byte 0 |
| Discrete input 8191 | `%I1023.7` | highest exposed input bit |
Formulas:
```
coil_addr = byte_index * 8 + bit_index (e.g. %Q5.3 → coil 43)
discr_addr = byte_index * 8 + bit_index (e.g. %I10.2 → disc 82)
```
- **1-based Modicon form adds 1:** coil 0 (wire) = `00001` (Modicon), etc.
Our driver sends the 0-based PDU form, so `%Q0.0` writes to wire address 0.
- **Writing FC05/FC15 to `%Q` is accepted even while the CPU is in STOP** —
the PLC's process image doesn't care about the user program state. But the
output won't propagate to the physical module until RUN (see STOP section
below) [1][21].
- **`%M` markers require a DB-backed `Array of Bool`** as described in the
Address Mapping section. Our driver can't assume "coil N = MN.0" like it
can on Modicon — on S7 it's always Q/I unless the programmer built a
mapping DB [12].
- **Bit-inside-holding-register**: for `Array of Bool` inside the
`MB_HOLD_REG` DB, bool[0] is bit 0 of byte 0 → **low byte, low bit** of
Modbus register 40001. Most third-party clients probe this in the low
byte, so the common case works; the less-common case (bool[8]) is bit 0 of
byte 1 → **high byte, low bit** of Modbus register 40001. Clients that
test only bool[0] will pass and miss the mis-alignment on bool[8] [12][13].
Test names:
`S7_1200_Coil_0_Is_Q0_0`,
`S7_1200_Coil_8_Is_Q1_0`,
`S7_1200_Discrete_Input_7_Is_I0_7`,
`S7_1200_Coil_Write_In_STOP_Accepted_But_Output_Frozen`.
## Function Code Support & Max Registers Per Request
| FC | Name | S7-1200 / S7-1500 MB_SERVER | CP 343-1 / CP 443-1 MODBUSCP | Max qty per request |
|----|----------------------------|-----------------------------|------------------------------|--------------------------------|
| 01 | Read Coils | Yes | Yes | 2000 bits (spec) |
| 02 | Read Discrete Inputs | Yes | Yes | 2000 bits (spec) |
| 03 | Read Holding Registers | Yes | Yes | **125** (spec max) |
| 04 | Read Input Registers | Yes | Yes | **125** |
| 05 | Write Single Coil | Yes | Yes | 1 |
| 06 | Write Single Register | Yes | Yes | 1 |
| 15 | Write Multiple Coils | Yes | Yes | 1968 bits (spec) — *see note* |
| 16 | Write Multiple Registers | Yes | Yes | **123** (spec max for TCP) |
| 07 | Read Exception Status | No (RTU only) | No | — |
| 17 | Report Server ID | No | No | — |
| 20/21 | Read/Write File Record | No | No | — |
| 22 | Mask Write Register | No | No | — |
| 23 | Read/Write Multiple | No | No | — |
| 43 | Read Device Identification | No | No | — |
- **S7-1200/1500 honour the full spec maxima** for FC03/04 (125) and FC16
(123) [1][22]. No sub-spec cap like DL260's 100-register FC16 limit.
- **FC15 (Write Multiple Coils) on `MB_SERVER`** writes into `%Q`, which maxes
out at 1024 bytes = 8192 bits, but the spec's 1968-bit per-request limit
caps any single call first [1][9].
- **`MB_HOLD_REG` buffer size is bounded by DB size** — max DB size on
S7-1200 is 64 KB, on S7-1500 is much larger (several MB depending on CPU),
so the practical `MB_HOLD_REG` limit is 32767 16-bit registers on S7-1200
and effectively unbounded on S7-1500 [22][23]. The *per-request* limit is
still 125.
- **Read past the end of `MB_HOLD_REG`** returns exception `02` (Illegal
Data Address) at the start of the overflow register, not a partial read
[1][8].
- **Request larger than spec max** (e.g. FC03 quantity 126) returns exception
`03` (Illegal Data Value). Verified on S7-1200 V4.2 [1][24].
- **CP 343-1 `MODBUSCP` per-request maxima are spec** (125/125/123/1968/2000),
matching the standard [4]. The CP's `MODBUS_PARAM_CP` caps the total
*exposed* range, not the per-call quantity.
Test names:
`S7_1200_FC03_126_Registers_Returns_IllegalDataValue`,
`S7_1200_FC16_124_Registers_Returns_IllegalDataValue`,
`S7_1200_FC03_Past_MB_HOLD_REG_End_Returns_IllegalDataAddress`,
`S7_1200_FC17_ReportServerId_Returns_IllegalFunction`.
## Exception Codes
S7 Modbus servers return only the four standard exception codes [1][4]:
| Code | Name | Triggered by |
|------|-----------------------|----------------------------------------------------------------------|
| 01 | Illegal Function | FC not in the supported list (17, 20-23, 43, any undefined FC) |
| 02 | Illegal Data Address | Register outside `MB_HOLD_REG` / outside `MODBUSCP` param-block range |
| 03 | Illegal Data Value | Quantity exceeds spec (FC03/04 > 125, FC16 > 123, FC01/02 > 2000, FC15 > 1968) |
| 04 | Server Failure | Runtime error inside MB_SERVER (DB access fault, corrupt DB header, MB_SERVER disabled mid-request) [1][24] |
- **No proprietary exception codes (05/06/0A/0B) are used** on any S7
Modbus server [1][4]. Our driver's status-code mapper can treat these as
"never observed" on the S7 profile.
- **CPU in STOP → `MB_SERVER` keeps running if it's in OB1 of the firmware's
communication task, but OB1 itself is not scanned.** In practice:
- Holding-register *reads* (FC03) continue to return the last DB values
frozen at the moment the CPU entered STOP. The `MB_SERVER` block is in
OB1 so it isn't re-invoked; however the TCP stack keeps the socket open
and returns cached data on subsequent polls [1][21]. **Unconfirmed**
whether this is cached in the CP or in the CPU's communication processor;
behaviour varies between firmware 4.0 and 4.5 [21].
- Holding-register *writes* (FC06/FC16) during STOP return exception `04`
(Server Failure) on S7-1200 V4.2+, and return success-but-discarded on
older firmware [1][24]. Our driver should treat FC06/FC16 during STOP as
non-deterministic and not rely on the response code.
- Coil *writes* (FC05/FC15) to `%Q` are *accepted* by the process image
during STOP, but the physical output freezes at its last RUN-mode value
(or the configured STOP-mode substitute value) until RUN resumes [1][21].
- **Writing a read-only address via FC06/FC16**: returns `02` (Illegal Data
Address), not `04`. S7 does not have "write-protected" holding registers —
the programmer either exposes a DB for read-write or doesn't expose it at
all [1][12].
STATUS codes (returned in the `STATUS` output of the block, not on the wire):
- `0x0000` — no error.
- `0x7001` — first call, connection being established.
- `0x7002` — subsequent cyclic call, connection in progress.
- `0x8383` — data access error (optimized DB, DB too small, or type mismatch)
[10][24].
- `0x8188` — invalid parameter combination (e.g. MB_MODE out of range) [24].
- `0x80C8` — mismatched UNIT_ID between MB_CLIENT and `MB_SERVER` [25].
Test names:
`S7_1200_FC03_Outside_HoldReg_Returns_IllegalDataAddress`,
`S7_1200_FC16_In_STOP_Returns_ServerFailure`,
`S7_1200_FC03_In_STOP_Returns_Cached_Values`,
`S7_1200_No_Proprietary_ExceptionCodes_0x05_0x06_0x0A_0x0B`.
## Connection Behavior
- **Max simultaneous Modbus TCP connections**:
- **S7-1200**: shares a pool of 8 open-communication connections across
all TCP/UDP/Modbus use. On a CPU 1211C you get 8 total; on 1215C/1217C
still 8 shared among PG/HMI/OUC/Modbus. Each `MB_SERVER` instance
reserves one. A typical site with a PG + 1 HMI + 2 Modbus clients uses
4 of the 8 [1][26].
- **S7-1500**: up to **8 concurrent Modbus TCP server connections** per
`MB_SERVER` port, across multiple `MB_SERVER` instance DBs each with a
unique port. Total open-communication resources depend on CPU (e.g.
CPU 1515-2 PN supports 128 OUC connections total; Modbus is a subset)
[1][27].
- **CP 343-1 Lean**: up to **8** simultaneous Modbus TCP connections on
port 502 [4][5]. Exceeding this refuses at TCP accept.
- **CP 443-1 Advanced**: up to **16** simultaneous Modbus TCP connections
[4].
- **Multi-connection model on `MB_SERVER`**: one instance DB per connection.
An instance DB listening on port 502 serves exactly one connection at a
time; to serve N simultaneous clients you need N instance DBs each with a
unique port (502/503/504...). **This is a real trap** — most users expect
port 502 to multiplex [27][28]. Our driver must not assume port 502 is the
only listener.
- **Keep-alive**: S7-1500's TCP stack does send TCP keepalives (default
every ~30 s) but the interval is not exposed as a configurable. S7-1200 is
the same. CP 343-1 keepalives are configured via HW Config → CP properties
→ Options → "Send keepalive" (default **off** on older firmware, default
**on** on firmware V3.0+) [1][29]. Driver-side keepalive is still
advisable for S7-300/CP 343-1 on old firmware.
- **Idle-timeout close**: `MB_SERVER` does *not* close idle sockets on its
own. However, the TCP stack on S7-1500 will close a socket that fails
three consecutive keepalive probes (~2 minutes). Forum reports describe
`MB_SERVER` connections "dying overnight" on S7-1500 when an HMI stops
polling — the fix is to enable driver-side periodic reads or driver-side
TCP keepalive [29][30].
- **Reconnect after power cycle**: MB_SERVER starts listening ~1-2 seconds
after the CPU reaches RUN. If the client reconnects during STARTUP OB
(OB100), the connection is refused until OB1 runs the block at least once.
Our driver should back off and retry on `ECONNREFUSED` for the first 5
seconds after a power-cycle detection [1][24].
- **Unit Identifier**: `MB_SERVER` accepts **any** Unit ID by default — there
is no configurable filter; the PLC ignores the Unit ID field entirely.
`MB_CLIENT` defaults to Unit ID = 255 as "ignore" [25][31]. Some
third-party Modbus-TCP gateways *require* a specific Unit ID; sending
anything to S7 is safe. **CP 343-1 `MODBUSCP`** also accepts any Unit ID
in server mode, but the parameter DB exposes a `single_write` / `unit_id`
field on newer firmware to allow filtering [4].
Test names:
`S7_1200_9th_TCP_Connection_Refused_On_8_Conn_Pool`,
`S7_1500_Port_503_Required_For_Second_Instance`,
`S7_1200_Reconnect_After_Power_Cycle_Succeeds_Within_5s`,
`S7_1200_Unit_ID_Ignored_Any_Accepted`.
## Behavioral Oddities
- **Transaction ID echo** is reliable on all S7 variants. `MB_SERVER` copies
the MBAP TxId verbatim. No known firmware that drops TxId under load [1][31].
- **Request serialization**: a single `MB_SERVER` instance serializes
requests from its one connected client — the block processes one PDU per
call and calls happen once per OB1 scan. OB1 scan time of 5-50 ms puts an
upper bound on throughput at ~20-200 requests/sec per connection [1][30].
Multiple `MB_SERVER` instances (one per port) run in parallel because OB1
calls them sequentially within the same scan.
- **OB1 scan coupling**: `MB_SERVER` must be called cyclically from OB1 (or
another cyclic OB). If the programmer puts it in a conditional branch
that doesn't fire every scan, requests time out. The STATUS `0x7002`
"in progress" is *expected* between calls, not an error [1][24].
- **Optimized DB backing `MB_HOLD_REG`** — already covered in Address
Mapping; STATUS becomes `0x8383`. This is the most common deployment bug
on S7-1500 projects migrated from older S7-1200 examples [10][11].
- **CPU STOP behaviour** — covered in Exception Codes section. The short
version: reads may return stale data without error; writes return exception
04 on modern firmware.
- **Partial-frame disconnect**: S7-1200/1500 TCP stack closes the socket on
any MBAP header where the `Length` field doesn't match the PDU length.
Driver must detect half-close and reconnect [1][29].
- **MBAP `Protocol ID` must be 0**. Any non-zero value causes the CP/CPU to
drop the frame silently (no response, no RST) on S7-1500 firmware V2.0
through V2.9; firmware V3.0+ sends an RST [1][30]. *Unconfirmed* whether
V3.1 still sends RST or returns to silent drop.
- **FC01/FC02 access outside `%Q`/`%I` range**: on S7-1200, requesting
coil address 8192 (= `%Q1024.0`) returns exception `02` (Illegal Data
Address) [1][9]. The 8192-bit hard cap is a process-image size limit on
the CPU, not a Modbus protocol limit.
- **`MB_CLIENT` UNIT_ID mismatch with remote `MB_SERVER`** produces STATUS
`0x80C8` on the client side, and the server silently discards the frame
(no response on the wire) [25]. This matters for Modbus-TCP-to-RTU
gateway scenarios where the Unit ID picks the RTU slave.
- **Non-IEEE REAL / BCD**: S7 does *not* use BCD like DirectLOGIC. `Real` is
always IEEE 754 single-precision. `LReal` (8-byte double) occupies 4
Modbus registers in `ABCDEFGH` order (big-endian byte, big-endian word)
[15][18].
- **`MODBUSCP` single-write** on CP 343-1: a parameter `single_write` in the
param DB controls whether FC06 on a register in the "holding register"
area triggers a callback to the user program vs. updates the DB directly.
Default is direct update. If a ladder programmer enables the callback
without implementing the callback OB, FC06 writes hang for 5 seconds then
return exception `04` [4].
Test names:
`S7_1200_TxId_Preserved_Across_Burst_Of_50_Requests`,
`S7_1200_MBSERVER_Throughput_Capped_By_OB1_Scan`,
`S7_1200_MBAP_ProtocolID_NonZero_Frame_Dropped`,
`S7_1200_Partial_MBAP_Causes_Half_Close`.
## Model-specific Differences Worth Separate Test Coverage
- **S7-1200 V4.0 vs V4.4+**: Older firmware does not support `WString` over
`MB_HOLD_REG` and returns `0x8383` if the DB contains one [18][24]. Test
both firmware bands separately.
- **S7-1500 vs S7-1200**: S7-1500 supports multiple `MB_SERVER` instances on
the *same* CPU with different ports cleanly; S7-1200 can too but its
8-connection pool is shared tighter [1][27]. Throughput per-connection is
~5× faster on S7-1500 because the comms task runs on a dedicated core.
- **S7-300 + CP 343-1 vs S7-1200/1500**: parameter-block mapping (not
`MB_HOLD_REG` pointer), per-connection license, no `%Q`/`%I` direct
access for coils (everything goes through a DB), different STATUS codes
(`DONE`/`ERROR`/`STATUS` word pairs vs. the single STATUS word) [4][14].
Driver-side it's a different profile.
- **CP 343-1 Lean vs CP 343-1 Advanced**: Lean is server-only; Advanced is
client + server. Lean's max connections = 8; Advanced = 16 [4][5].
- **CP 443-1 in S7-400H**: uses `MODBUSCP_REDUNDANT` which presents two
Ethernet endpoints that fail over. Our driver's redundancy support should
recognize the S7-400H profile as "two IP addresses, same server state,
advertise via `ServerUriArray`" [6].
- **ET 200SP CPU (1510SP / 1512SP)**: behaves as S7-1500 from `MB_SERVER`
perspective. No known deltas [3].
## References
1. Siemens Industry Online Support, *Modbus/TCP Communication between SIMATIC S7-1500 / S7-1200 and Modbus/TCP Controllers with Instructions `MB_CLIENT` and `MB_SERVER`*, Entry ID 102020340, V6 (Feb 2021). https://cache.industry.siemens.com/dl/files/340/102020340/att_118119/v6/net_modbus_tcp_s7-1500_s7-1200_en.pdf
2. Siemens TIA Portal Online Docs, *MB_SERVER instruction*. https://docs.tia.siemens.cloud/r/simatic_s7_1200_manual_collection_eses_20/communication-processor-and-modbus-tcp/modbus-communication/modbus-tcp/modbus-tcp-instructions/mb_server-communicate-using-profinet-as-modbus-tcp-server-instruction
3. Siemens, *SIMATIC S7-1500 Communication Function Manual* (covers ET 200SP CPU). http://public.eandm.com/Public_Docs/s71500_communication_function_manual_en-US_en-US.pdf
4. Siemens Industry Online Support, *SIMATIC Modbus/TCP communication using CP 343-1 and CP 443-1 — Programming Manual*, Entry ID 103447617. https://cache.industry.siemens.com/dl/files/617/103447617/att_106971/v1/simatic_modbus_tcp_cp_en-US_en-US.pdf
5. Siemens Industry Online Support FAQ *"Which technical data applies for the SIMATIC Modbus/TCP software for CP 343-1 / CP 443-1?"*, Entry ID 104946406. https://www.industry-mobile-support.siemens-info.com/en/article/detail/104946406
6. Siemens Industry Online Support, *Redundant Modbus/TCP communication via CP 443-1 in S7-400H systems*, Entry ID 109739212. https://cache.industry.siemens.com/dl/files/212/109739212/att_887886/v1/SIMATIC_modbus_tcp_cp_red_e_en-US.pdf
7. Siemens Industry Online Support, *SIMATIC MODBUS (TCP) PN CPU Library — Programming and Operating Manual 06/2014*, Entry ID 75330636. https://support.industry.siemens.com/cs/attachments/75330636/ModbusTCPPNCPUen.pdf
8. DMC Inc., *Using an S7-1200 PLC as a Modbus TCP Slave*. https://www.dmcinfo.com/blog/27313/using-an-s7-1200-plc-as-a-modbus-tcp-slave/
9. Siemens, *SIMATIC S7-1200 System Manual* (V4.x), "MB_SERVER" pages 736-742. https://www.manualslib.com/manual/1453610/Siemens-S7-1200.html?page=736
10. lamaPLC, *Simatic Modbus S7 error- and statuscodes*. https://www.lamaplc.com/doku.php?id=simatic:errorcodes
11. ScadaProtocols, *How to Configure Modbus TCP on Siemens S7-1200 (TIA Portal Step-by-Step)*. https://scadaprotocols.com/modbus-tcp-siemens-s7-1200-tia-portal/
12. Industrial Monitor Direct, *Reading and Writing Memory Bits via Modbus TCP on S7-1200*. https://industrialmonitordirect.com/blogs/knowledgebase/reading-and-writing-memory-bits-via-modbus-tcp-on-s7-1200
13. PLCtalk forum *"Siemens S7-1200 modbus understanding"*. https://www.plctalk.net/forums/threads/siemens-s7-1200-modbus-understanding.104119/
14. Siemens SIMATIC S7 Manual, "Function block MODBUSCP — Functionality" (ManualsLib p29). https://www.manualslib.com/manual/1580661/Siemens-Simatic-S7.html?page=29
15. Chipkin, *How Real (Floating Point) and 32-bit Data is Encoded in Modbus*. https://store.chipkin.com/articles/how-real-floating-point-and-32-bit-data-is-encoded-in-modbus-rtu-messages
16. Siemens Industry Online Support forum, *MODBUS DATA conversion in S7-1200 CPU*, Entry ID 97287. https://support.industry.siemens.com/forum/WW/en/posts/modbus-data-converson-in-s7-1200-cpu/97287
17. Industrial Monitor Direct, *Siemens S7-1500 MB_SERVER Modbus TCP Configuration Guide*. https://industrialmonitordirect.com/de/blogs/knowledgebase/siemens-s7-1500-mb-server-modbus-tcp-configuration-guide
18. Siemens TIA Portal, *Data types in SIMATIC S7-1200/1500 — String/WString header layout* (system manual, "Elementary Data Types").
19. Kepware / PTC, *Siemens TCP/IP Ethernet Driver Help*, "Byte / Word Order" tag property. https://www.opcturkey.com/uploads/siemens-tcp-ip-ethernet-manual.pdf
20. Siemens SiePortal forum, *Transfer float out of words*, Entry ID 187811. https://sieportal.siemens.com/en-ww/support/forum/posts/transfer-float-out-of-words/187811 _(operator-reported "S7 swaps float" claim — traced to remote-device issue; **unconfirmed**.)_
21. Siemens SiePortal forum, *S7-1200 communication with Modbus TCP*, Entry ID 133086. https://support.industry.siemens.com/forum/WW/en/posts/s7-1200-communication-with-modbus-tcp/133086
22. Siemens SiePortal forum, *S7-1500 MB Server Holding Register Max Word*, Entry ID 224636. https://support.industry.siemens.com/forum/WW/en/posts/s7-1500-mb-server-holding-register-max-word/224636
23. Siemens, *SIMATIC S7-1500 Technical Specifications* — CPU-specific DB size limits in each CPU manual's "Memory" table.
24. Siemens TIA Portal Online Docs, *Error messages (S7-1200, S7-1500) — Modbus instructions*. https://docs.tia.siemens.cloud/r/en-us/v20/modbus-rtu-s7-1200-s7-1500/error-messages-s7-1200-s7-1500
25. Industrial Monitor Direct, *Fix Siemens S7-1500 MB_Client UnitID Error 80C8*. https://industrialmonitordirect.com/blogs/knowledgebase/troubleshooting-mb-client-on-s7-1500-cpu-1515sp-modbus-tcp
26. Siemens SiePortal forum, *How many TCP connections can the S7-1200 make?*, Entry ID 275570. https://support.industry.siemens.com/forum/WW/en/posts/how-many-tcp-connections-can-the-s7-1200-make/275570
27. Siemens SiePortal forum, *Simultaneous connections of Modbus TCP*, Entry ID 189626. https://support.industry.siemens.com/forum/ww/en/posts/simultaneous-connections-of-modbus-tcp/189626
28. Siemens SiePortal forum, *How many Modbus TCP IP clients can read simultaneously from S7-1517*, Entry ID 261569. https://support.industry.siemens.com/forum/WW/en/posts/how-many-modbus-tcp-ip-client-can-read-simultaneously-in-s7-1517/261569
29. Industrial Monitor Direct, *Troubleshooting Intermittent Modbus TCP Connections on S7-1500 PLC*. https://industrialmonitordirect.com/blogs/knowledgebase/troubleshooting-intermittent-modbus-tcp-connections-on-s7-1500-plc
30. PLCtalk forum *"S7-1500 modbus tcp speed?"*. https://www.plctalk.net/forums/threads/s7-1500-modbus-tcp-speed.114046/
31. Siemens SiePortal forum, *MB_Unit_ID parameter in Modbus TCP*, Entry ID 156635. https://support.industry.siemens.com/forum/WW/en/posts/mb-unit-id-parameter-in-modbus-tcp/156635

View File

@@ -189,6 +189,43 @@ Modbus has no native String, DateTime, or Int64 — those rows are skipped on th
- **ab_server tag-type coverage is finite** (BOOL, DINT, REAL, arrays, basic strings). UDTs and `Program:` scoping are not fully implemented. Document an "ab_server-supported tag set" in the harness and exclude the rest from default CI; UDT coverage moves to the Studio 5000 Emulate golden-box tier.
- CIP has no native subscriptions, so polling behavior matches real hardware.
### CI fixture (task #180)
The integration harness at `tests/ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests/` exposes two test-time contracts:
- **`AbServerFixture(AbServerProfile)`** — starts the simulator with the CLI args composed from the profile's `--plc` family + seed-tag set. One fixture instance per family, one simulator process per test case (smoke tier). For larger suites that can share a simulator across several reads/writes, use a `IClassFixture<AbServerFixture>` wrapper per family.
- **`KnownProfiles.{ControlLogix, CompactLogix, Micro800, GuardLogix}`** — the four per-family profiles. Drives the simulator's `--plc` mode + the preseed `--tag name:type[:size]` set. Micro800 + GuardLogix fall back to `controllogix` under the hood because ab_server has no dedicated mode for them — the driver-side family profile still enforces the narrower connection shape / safety classification separately.
**Pinned version** (recorded in `ci/ab-server.lock.json` so drift is one-file visible):
- `libplctag` **v2.6.16** (published 2026-03-29) — `ab_server.exe` ships inside the `_tools.zip` asset alongside `plctag.dll` + two `list_tags_*` helpers.
- Windows x64: `libplctag_2.6.16_windows_x64_tools.zip` — SHA256 `9b78a3dee73d9cd28ca348c090f453dbe3ad9d07ad6bf42865a9dc3a79bc2232`
- Windows x86: `libplctag_2.6.16_windows_x86_tools.zip` — SHA256 `fdfefd58b266c5da9a1ded1a430985e609289c9e67be2544da7513b668761edf`
- Windows ARM64: `libplctag_2.6.16_windows_arm64_tools.zip` — SHA256 `d747728e4c4958bb63b4ac23e1c820c4452e4778dfd7d58f8a0aecd5402d4944`
**CI step:**
```yaml
# GitHub Actions step placed before `dotnet test`:
- name: Fetch ab_server (libplctag v2.6.16)
shell: pwsh
run: |
$pin = Get-Content ci/ab-server.lock.json | ConvertFrom-Json
$asset = $pin.assets.'windows-x64' # swap to windows-x86 / windows-arm64 on non-x64 runners
$url = "https://github.com/libplctag/libplctag/releases/download/$($pin.tag)/$($asset.file)"
$zip = Join-Path $env:RUNNER_TEMP 'libplctag-tools.zip'
Invoke-WebRequest $url -OutFile $zip
$actual = (Get-FileHash -Algorithm SHA256 $zip).Hash.ToLower()
if ($actual -ne $asset.sha256) { throw "libplctag tools SHA256 mismatch: expected $($asset.sha256), got $actual" }
$dest = Join-Path $env:RUNNER_TEMP 'libplctag-tools'
Expand-Archive $zip -DestinationPath $dest
Add-Content $env:GITHUB_PATH $dest
```
The fixture's `LocateBinary()` picks the binary up off PATH so the C# harness doesn't own the download — CI YAML is the right place for version pinning + hash verification. Developer workstations install the binary once from source (`cmake + make ab_server` under a libplctag clone) and the same fixture works identically.
Tests without ab_server on PATH are marked `Skip` via `AbServerFactAttribute` / `AbServerTheoryAttribute`, so fresh-clone runs without the simulator still pass all unit suites in this project.
---
## 3. Allen-Bradley Legacy (SLC 500 / MicroLogix, PCCC)

View File

@@ -0,0 +1,109 @@
# v2 Release Readiness
> **Last updated**: 2026-04-19 (all three release blockers CLOSED — Phase 6.3 Streams A/C core shipped)
> **Status**: **RELEASE-READY (code-path)** for v2 GA — all three code-path release blockers are closed. Remaining work is manual (client interop matrix, deployment checklist signoff, OPC UA CTT pass) + hardening follow-ups; see exit-criteria checklist below.
This doc is the single view of where v2 stands against its release criteria. Update it whenever a deferred follow-up closes or a new release blocker is discovered.
## Release-readiness dashboard
| Phase | Shipped | Status |
|---|---|---|
| Phase 0 — Rename + entry gate | ✓ | Shipped |
| Phase 1 — Configuration + Admin scaffold | ✓ | Shipped (some UI items deferred to 6.4) |
| Phase 2 — Galaxy driver split (Proxy/Host/Shared) | ✓ | Shipped |
| Phase 3 — OPC UA server + LDAP + security profiles | ✓ | Shipped |
| Phase 4 — Redundancy scaffold (entities + endpoints) | ✓ | Shipped (runtime closes in 6.3) |
| Phase 5 — Drivers | ⚠ partial | Galaxy / Modbus / S7 / OpcUaClient shipped; AB CIP / AB Legacy / TwinCAT / FOCAS deferred (task #120) |
| Phase 6.1 — Resilience & Observability | ✓ | **SHIPPED** (PRs #7883) |
| Phase 6.2 — Authorization runtime | ◐ core | **SHIPPED (core)** (PRs #8488); dispatch wiring + Admin UI deferred |
| Phase 6.3 — Redundancy runtime | ◐ core | **SHIPPED (core)** (PRs #8990); coordinator + UA-node wiring + Admin UI + interop deferred |
| Phase 6.4 — Admin UI completion | ◐ data layer | **SHIPPED (data layer)** (PRs #9192); Blazor UI + OPC 40010 address-space wiring deferred |
**Aggregate test counts:** 906 baseline (pre-Phase-6) → **1159 passing** across Phase 6. One pre-existing Client.CLI `SubscribeCommandTests.Execute_PrintsSubscriptionMessage` flake tracked separately.
## Release blockers (must close before v2 GA)
Ordered by severity + impact on production fitness.
### ~~Security — Phase 6.2 dispatch wiring~~ (task #143 — **CLOSED** 2026-04-19, PR #94)
**Closed**. `AuthorizationGate` + `NodeScopeResolver` now thread through `OpcUaApplicationHost → OtOpcUaServer → DriverNodeManager`. `OnReadValue` + `OnWriteValue` + all four HistoryRead paths call `gate.IsAllowed(identity, operation, scope)` before the invoker. Production deployments activate enforcement by constructing `OpcUaApplicationHost` with an `AuthorizationGate(StrictMode: true)` + populating the `NodeAcl` table.
Additional Stream C surfaces (not release-blocking, hardening only):
- Browse + TranslateBrowsePathsToNodeIds gating with ancestor-visibility logic per `acl-design.md` §Browse.
- CreateMonitoredItems + TransferSubscriptions gating with per-item `(AuthGenerationId, MembershipVersion)` stamp so revoked grants surface `BadUserAccessDenied` within one publish cycle (decision #153).
- Alarm Acknowledge / Confirm / Shelve gating.
- Call (method invocation) gating.
- Finer-grained scope resolution — current `NodeScopeResolver` returns a flat cluster-level scope. Joining against the live Configuration DB to populate UnsArea / UnsLine / Equipment path is tracked as Stream C.12.
- 3-user integration matrix covering every operation × allow/deny.
These are additional hardening — the three highest-value surfaces (Read / Write / HistoryRead) are now gated, which covers the base-security gap for v2 GA.
### ~~Config fallback — Phase 6.1 Stream D wiring~~ (task #136 — **CLOSED** 2026-04-19, PR #96)
**Closed**. `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag` end-to-end: bootstrap calls go through the timeout → retry → fallback-to-sealed pipeline; every central-DB success writes a fresh sealed snapshot so the next cache-miss has a known-good fallback; `StaleConfigFlag.IsStale` is now consumed by `HealthEndpointsHost.usingStaleConfig` so `/healthz` body reports reality.
Production activation: Program.cs switches `NodeBootstrap → SealedBootstrap` + constructs `OpcUaApplicationHost` with the `StaleConfigFlag` as an optional ctor parameter.
Remaining follow-ups (hardening, not release-blocking):
- A `HostedService` that polls `sp_GetCurrentGenerationForCluster` periodically so peer-published generations land in this node's cache without a restart.
- Richer snapshot payload via `sp_GetGenerationContent` so fallback can serve the full generation content (DriverInstance enumeration, ACL rows, etc.) from the sealed cache alone.
### ~~Redundancy — Phase 6.3 Streams A/C core~~ (tasks #145 + #147 — **CLOSED** 2026-04-19, PRs #9899)
**Closed**. The runtime orchestration layer now exists end-to-end:
- `RedundancyCoordinator` reads `ClusterNode` + peer list at startup (Stream A shipped in PR #98). Invariants enforced: 1-2 nodes (decision #83), unique ApplicationUri (#86), ≤1 Primary in Warm/Hot (#84). Startup fails fast on violation; runtime refresh logs + flips `IsTopologyValid=false` so the calculator falls to band 2 without tearing down.
- `RedundancyStatePublisher` orchestrates topology + apply lease + recovery state + peer reachability through `ServiceLevelCalculator` + emits `OnStateChanged` / `OnServerUriArrayChanged` edge-triggered events (Stream C core shipped in PR #99). The OPC UA `ServiceLevel` Byte variable + `ServerUriArray` String[] variable subscribe to these events.
Remaining Phase 6.3 surfaces (hardening, not release-blocking):
- `PeerHttpProbeLoop` + `PeerUaProbeLoop` HostedServices that poll the peer + write to `PeerReachabilityTracker` on each tick. Without these the publisher sees `PeerReachability.Unknown` for every peer → Isolated-Primary band (230) even when the peer is up. Safe default (retains authority) but not the full non-transparent-redundancy UX.
- OPC UA variable-node wiring layer: bind the `ServiceLevel` Byte node + `ServerUriArray` String[] node to the publisher's events via `BaseDataVariable.OnReadValue` / direct value push. Scoped follow-up on the Opc.Ua.Server stack integration.
- `sp_PublishGeneration` wraps its apply in `await using var lease = coordinator.BeginApplyLease(...)` so the `PrimaryMidApply` band (200) fires during actual publishes (task #148 part 2).
- Client interop matrix validation — Ignition / Kepware / Aveva OI Gateway (Stream F, task #150). Manual + doc-only work; doesn't block code ship.
### Remaining drivers (task #120)
AB CIP, AB Legacy, TwinCAT ADS, FOCAS drivers are planned but unshipped. Decision pending on whether these are release-blocking for v2 GA or can slip to a v2.1 follow-up.
## Nice-to-haves (not release-blocking)
- **Admin UI** — Phase 6.1 Stream E.2/E.3 (`/hosts` column refresh), Phase 6.2 Stream D (`RoleGrantsTab` + `AclsTab` Probe), Phase 6.3 Stream E (`RedundancyTab`), Phase 6.4 Streams A/B UI pieces, Stream C DiffViewer, Stream D `IdentificationFields.razor`. Tasks #134, #144, #149, #153, #155, #156, #157.
- **Background services** — Phase 6.1 Stream B.4 `ScheduledRecycleScheduler` HostedService (task #137), Phase 6.1 Stream A analyzer (task #135 — Roslyn analyzer asserting every capability surface routes through `CapabilityInvoker`).
- **Multi-host dispatch** — Phase 6.1 Stream A follow-up (task #135). Currently every driver gets a single pipeline keyed on `driver.DriverInstanceId`; multi-host drivers (Modbus with N PLCs) need per-PLC host resolution so failing PLCs trip per-PLC breakers without poisoning siblings. Decision #144 requires this but we haven't wired it yet.
## Running the release-readiness check
```bash
pwsh ./scripts/compliance/phase-6-all.ps1
```
This meta-runner invokes each `phase-6-N-compliance.ps1` script in sequence and reports an aggregate PASS/FAIL. It is the single-command verification that what we claim is shipped still compiles + tests pass + the plan-level invariants are still satisfied.
Exit 0 = every phase passes its compliance checks + no test-count regression.
## Release-readiness exit criteria
v2 GA requires all of the following:
- [ ] All four Phase 6.N compliance scripts exit 0.
- [ ] `dotnet test ZB.MOM.WW.OtOpcUa.slnx` passes with ≤ 1 known-flake failure.
- [ ] Release blockers listed above all closed (or consciously deferred to v2.1 with a written decision).
- [ ] Production deployment checklist (separate doc) signed off by Fleet Admin.
- [ ] At least one end-to-end integration run against the live Galaxy on the dev box succeeds.
- [ ] OPC UA conformance test (CTT or UA Compliance Test Tool) passes against the live endpoint.
- [ ] Non-transparent redundancy cutover validated with at least one production client (Ignition 8.3 recommended — see decision #85).
## Change log
- **2026-04-19** — Release blocker #3 **closed** (PRs #9899). Phase 6.3 Streams A + C core shipped: `ClusterTopologyLoader` + `RedundancyCoordinator` + `RedundancyStatePublisher` + `PeerReachabilityTracker`. Code-path release blockers all closed; remaining Phase 6.3 surfaces (peer-probe HostedServices, OPC UA variable-node binding, sp_PublishGeneration lease wrap, client interop matrix) are hardening follow-ups.
- **2026-04-19** — Release blocker #2 **closed** (PR #96). `SealedBootstrap` consumes `ResilientConfigReader` + `GenerationSealedCache` + `StaleConfigFlag`; `/healthz` now surfaces the stale flag. Remaining follow-ups (periodic poller + richer snapshot payload) downgraded to hardening.
- **2026-04-19** — Release blocker #1 **closed** (PR #94). `AuthorizationGate` wired into `DriverNodeManager` Read / Write / HistoryRead dispatch. Remaining Stream C surfaces (Browse / Subscribe / Alarm / Call + finer-grained scope resolution) downgraded to hardening follow-ups — no longer release-blocking.
- **2026-04-19** — Phase 6.4 data layer merged (PRs #9192). Phase 6 core complete. Capstone doc created.
- **2026-04-19** — Phase 6.3 core merged (PRs #8990). `ServiceLevelCalculator` + `RecoveryStateManager` + `ApplyLeaseRegistry` land as pure logic; coordinator / UA-node wiring / Admin UI / interop deferred.
- **2026-04-19** — Phase 6.2 core merged (PRs #8488). `AuthorizationGate` + `TriePermissionEvaluator` + `LdapGroupRoleMapping` land; dispatch wiring + Admin UI deferred.
- **2026-04-19** — Phase 6.1 shipped (PRs #7883). Polly resilience + Tier A/B/C stability + health endpoints + LiteDB generation-sealed cache + Admin `/hosts` data layer all live.

View File

@@ -0,0 +1,139 @@
<#
.SYNOPSIS
Phase 6.1 exit-gate compliance check. Each check either passes or records a
failure; non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.1 (Resilience & Observability runtime) completion. Checks
enumerated in `docs/v2/implementation/phase-6-1-resilience-and-observability.md`
§"Compliance Checks (run at exit gate)".
Runs a mix of file-presence checks, text-pattern sweeps over the committed
codebase, and a full `dotnet test` pass to exercise the invariants each
class encodes. Meant to be invoked from repo root.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-1-compliance.ps1
Exit: 0 = all checks passed; non-zero = one or more FAILs
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
$repoRoot = (Resolve-Path (Join-Path $PSScriptRoot '..\..')).Path
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check - $Reason" -ForegroundColor Red
$script:failures++
}
function Assert-Deferred {
param([string]$Check, [string]$FollowupPr)
Write-Host " [DEFERRED] $Check (follow-up: $FollowupPr)" -ForegroundColor Yellow
}
function Assert-FileExists {
param([string]$Check, [string]$RelPath)
$full = Join-Path $repoRoot $RelPath
if (Test-Path $full) { Assert-Pass "$Check ($RelPath)" }
else { Assert-Fail $Check "missing file: $RelPath" }
}
function Assert-TextFound {
param([string]$Check, [string]$Pattern, [string[]]$RelPaths)
foreach ($p in $RelPaths) {
$full = Join-Path $repoRoot $p
if (-not (Test-Path $full)) { continue }
if (Select-String -Path $full -Pattern $Pattern -Quiet) {
Assert-Pass "$Check (matched in $p)"
return
}
}
Assert-Fail $Check "pattern '$Pattern' not found in any of: $($RelPaths -join ', ')"
}
Write-Host ""
Write-Host "=== Phase 6.1 compliance - Resilience & Observability runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A - Resilience layer"
Assert-FileExists "Pipeline builder present" "src/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResiliencePipelineBuilder.cs"
Assert-FileExists "CapabilityInvoker present" "src/ZB.MOM.WW.OtOpcUa.Core/Resilience/CapabilityInvoker.cs"
Assert-FileExists "WriteIdempotentAttribute present" "src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/WriteIdempotentAttribute.cs"
Assert-TextFound "Pipeline key includes HostName (per-device isolation)" "PipelineKey\(.+HostName" @("src/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResiliencePipelineBuilder.cs")
Assert-TextFound "OnReadValue routes through invoker" "DriverCapability\.Read," @("src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs")
Assert-TextFound "OnWriteValue routes through invoker" "ExecuteWriteAsync" @("src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs")
Assert-TextFound "HistoryRead routes through invoker" "DriverCapability\.HistoryRead" @("src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs")
Assert-FileExists "Galaxy supervisor CircuitBreaker preserved" "src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/Supervisor/CircuitBreaker.cs"
Assert-FileExists "Galaxy supervisor Backoff preserved" "src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/Supervisor/Backoff.cs"
Write-Host ""
Write-Host "Stream B - Tier A/B/C runtime"
Assert-FileExists "DriverTier enum present" "src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTier.cs"
Assert-TextFound "DriverTypeMetadata requires Tier" "DriverTier Tier" @("src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/DriverTypeRegistry.cs")
Assert-FileExists "MemoryTracking present" "src/ZB.MOM.WW.OtOpcUa.Core/Stability/MemoryTracking.cs"
Assert-FileExists "MemoryRecycle present" "src/ZB.MOM.WW.OtOpcUa.Core/Stability/MemoryRecycle.cs"
Assert-TextFound "MemoryRecycle is Tier C gated" "_tier == DriverTier\.C" @("src/ZB.MOM.WW.OtOpcUa.Core/Stability/MemoryRecycle.cs")
Assert-FileExists "ScheduledRecycleScheduler present" "src/ZB.MOM.WW.OtOpcUa.Core/Stability/ScheduledRecycleScheduler.cs"
Assert-TextFound "Scheduler ctor rejects Tier A/B" "tier != DriverTier\.C" @("src/ZB.MOM.WW.OtOpcUa.Core/Stability/ScheduledRecycleScheduler.cs")
Assert-FileExists "WedgeDetector present" "src/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs"
Assert-TextFound "WedgeDetector is demand-aware" "HasPendingWork" @("src/ZB.MOM.WW.OtOpcUa.Core/Stability/WedgeDetector.cs")
Write-Host ""
Write-Host "Stream C - Health + logging"
Assert-FileExists "DriverHealthReport present" "src/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs"
Assert-FileExists "HealthEndpointsHost present" "src/ZB.MOM.WW.OtOpcUa.Server/Observability/HealthEndpointsHost.cs"
Assert-TextFound "State matrix: Healthy = 200" "ReadinessVerdict\.Healthy => 200" @("src/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs")
Assert-TextFound "State matrix: Faulted = 503" "ReadinessVerdict\.Faulted => 503" @("src/ZB.MOM.WW.OtOpcUa.Core/Observability/DriverHealthReport.cs")
Assert-FileExists "LogContextEnricher present" "src/ZB.MOM.WW.OtOpcUa.Core/Observability/LogContextEnricher.cs"
Assert-TextFound "Enricher pushes DriverInstanceId property" "DriverInstanceId" @("src/ZB.MOM.WW.OtOpcUa.Core/Observability/LogContextEnricher.cs")
Assert-TextFound "JSON sink opt-in via Serilog:WriteJson" "Serilog:WriteJson" @("src/ZB.MOM.WW.OtOpcUa.Server/Program.cs")
Write-Host ""
Write-Host "Stream D - LiteDB generation-sealed cache"
Assert-FileExists "GenerationSealedCache present" "src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs"
Assert-TextFound "Sealed files marked ReadOnly" "FileAttributes\.ReadOnly" @("src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs")
Assert-TextFound "Corruption fails closed with GenerationCacheUnavailableException" "GenerationCacheUnavailableException" @("src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs")
Assert-FileExists "ResilientConfigReader present" "src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs"
Assert-FileExists "StaleConfigFlag present" "src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/StaleConfigFlag.cs"
Write-Host ""
Write-Host "Stream E - Admin /hosts (data layer)"
Assert-FileExists "DriverInstanceResilienceStatus entity" "src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/DriverInstanceResilienceStatus.cs"
Assert-FileExists "DriverResilienceStatusTracker present" "src/ZB.MOM.WW.OtOpcUa.Core/Resilience/DriverResilienceStatusTracker.cs"
Assert-Deferred "FleetStatusHub SignalR push + Blazor /hosts column refresh" "Phase 6.1 Stream E.2/E.3 visual-compliance follow-up"
Write-Host ""
Write-Host "Cross-cutting"
Write-Host " Running full solution test suite..." -ForegroundColor DarkGray
$prevPref = $ErrorActionPreference
$ErrorActionPreference = 'Continue'
$testOutput = & dotnet test (Join-Path $repoRoot 'ZB.MOM.WW.OtOpcUa.slnx') --nologo 2>&1
$ErrorActionPreference = $prevPref
$passLine = $testOutput | Select-String 'Passed:\s+(\d+)' -AllMatches
$failLine = $testOutput | Select-String 'Failed:\s+(\d+)' -AllMatches
$passCount = 0; foreach ($m in $passLine.Matches) { $passCount += [int]$m.Groups[1].Value }
$failCount = 0; foreach ($m in $failLine.Matches) { $failCount += [int]$m.Groups[1].Value }
$baseline = 906
if ($passCount -ge $baseline) { Assert-Pass "No test-count regression ($passCount >= $baseline baseline)" }
else { Assert-Fail "Test-count regression" "passed $passCount < baseline $baseline" }
# Pre-existing Client.CLI Subscribe flake tracked separately; exit gate tolerates a single
# known flake but flags any NEW failures.
if ($failCount -le 1) { Assert-Pass "No new failing tests (pre-existing CLI flake tolerated)" }
else { Assert-Fail "New failing tests" "$failCount failures > 1 tolerated" }
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.1 compliance: PASS" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.1 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,147 @@
<#
.SYNOPSIS
Phase 6.2 exit-gate compliance check. Each check either passes or records a
failure; non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.2 (Authorization runtime) completion. Checks enumerated
in `docs/v2/implementation/phase-6-2-authorization-runtime.md`
§"Compliance Checks (run at exit gate)".
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-2-compliance.ps1
Exit: 0 = all checks passed; non-zero = one or more FAILs
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
$repoRoot = (Resolve-Path (Join-Path $PSScriptRoot '..\..')).Path
function Assert-Pass {
param([string]$Check)
Write-Host " [PASS] $Check" -ForegroundColor Green
}
function Assert-Fail {
param([string]$Check, [string]$Reason)
Write-Host " [FAIL] $Check - $Reason" -ForegroundColor Red
$script:failures++
}
function Assert-Deferred {
param([string]$Check, [string]$FollowupPr)
Write-Host " [DEFERRED] $Check (follow-up: $FollowupPr)" -ForegroundColor Yellow
}
function Assert-FileExists {
param([string]$Check, [string]$RelPath)
$full = Join-Path $repoRoot $RelPath
if (Test-Path $full) { Assert-Pass "$Check ($RelPath)" }
else { Assert-Fail $Check "missing file: $RelPath" }
}
function Assert-TextFound {
param([string]$Check, [string]$Pattern, [string[]]$RelPaths)
foreach ($p in $RelPaths) {
$full = Join-Path $repoRoot $p
if (-not (Test-Path $full)) { continue }
if (Select-String -Path $full -Pattern $Pattern -Quiet) {
Assert-Pass "$Check (matched in $p)"
return
}
}
Assert-Fail $Check "pattern '$Pattern' not found in any of: $($RelPaths -join ', ')"
}
function Assert-TextAbsent {
param([string]$Check, [string]$Pattern, [string[]]$RelPaths)
foreach ($p in $RelPaths) {
$full = Join-Path $repoRoot $p
if (-not (Test-Path $full)) { continue }
if (Select-String -Path $full -Pattern $Pattern -Quiet) {
Assert-Fail $Check "pattern '$Pattern' unexpectedly found in $p"
return
}
}
Assert-Pass "$Check (pattern '$Pattern' absent from: $($RelPaths -join ', '))"
}
Write-Host ""
Write-Host "=== Phase 6.2 compliance - Authorization runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A - LdapGroupRoleMapping (control plane)"
Assert-FileExists "LdapGroupRoleMapping entity present" "src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/LdapGroupRoleMapping.cs"
Assert-FileExists "AdminRole enum present" "src/ZB.MOM.WW.OtOpcUa.Configuration/Enums/AdminRole.cs"
Assert-FileExists "ILdapGroupRoleMappingService present" "src/ZB.MOM.WW.OtOpcUa.Configuration/Services/ILdapGroupRoleMappingService.cs"
Assert-FileExists "LdapGroupRoleMappingService impl present" "src/ZB.MOM.WW.OtOpcUa.Configuration/Services/LdapGroupRoleMappingService.cs"
Assert-TextFound "Write-time invariant: IsSystemWide XOR ClusterId" "IsSystemWide=true requires ClusterId" @("src/ZB.MOM.WW.OtOpcUa.Configuration/Services/LdapGroupRoleMappingService.cs")
Assert-FileExists "EF migration for LdapGroupRoleMapping" "src/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/20260419131444_AddLdapGroupRoleMapping.cs"
Write-Host ""
Write-Host "Stream B - Permission-trie evaluator (Core.Authorization)"
Assert-FileExists "OpcUaOperation enum present" "src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/OpcUaOperation.cs"
Assert-FileExists "NodeScope record present" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/NodeScope.cs"
Assert-FileExists "AuthorizationDecision tri-state" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/AuthorizationDecision.cs"
Assert-TextFound "Verdict has Denied member (reserved for v2.1)" "Denied" @("src/ZB.MOM.WW.OtOpcUa.Core/Authorization/AuthorizationDecision.cs")
Assert-FileExists "IPermissionEvaluator present" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/IPermissionEvaluator.cs"
Assert-FileExists "PermissionTrie present" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs"
Assert-FileExists "PermissionTrieBuilder present" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs"
Assert-FileExists "PermissionTrieCache present" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs"
Assert-TextFound "Cache keyed on GenerationId" "GenerationId" @("src/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs")
Assert-FileExists "UserAuthorizationState present" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs"
Assert-TextFound "MembershipFreshnessInterval default 15 min" "FromMinutes\(15\)" @("src/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs")
Assert-TextFound "AuthCacheMaxStaleness default 5 min" "FromMinutes\(5\)" @("src/ZB.MOM.WW.OtOpcUa.Core/Authorization/UserAuthorizationState.cs")
Assert-FileExists "TriePermissionEvaluator impl present" "src/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs"
Assert-TextFound "HistoryRead maps to NodePermissions.HistoryRead" "HistoryRead.+NodePermissions\.HistoryRead" @("src/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs")
Write-Host ""
Write-Host "Control/data-plane separation (decision #150)"
Assert-TextAbsent "Evaluator has zero references to LdapGroupRoleMapping" "LdapGroupRoleMapping" @(
"src/ZB.MOM.WW.OtOpcUa.Core/Authorization/TriePermissionEvaluator.cs",
"src/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrie.cs",
"src/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieBuilder.cs",
"src/ZB.MOM.WW.OtOpcUa.Core/Authorization/PermissionTrieCache.cs",
"src/ZB.MOM.WW.OtOpcUa.Core/Authorization/IPermissionEvaluator.cs")
Write-Host ""
Write-Host "Stream C foundation (dispatch-wiring gate)"
Assert-FileExists "ILdapGroupsBearer present" "src/ZB.MOM.WW.OtOpcUa.Server/Security/ILdapGroupsBearer.cs"
Assert-FileExists "AuthorizationGate present" "src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs"
Assert-TextFound "Gate has StrictMode knob" "StrictMode" @("src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs")
Assert-Deferred "DriverNodeManager dispatch-path wiring (11 surfaces)" "Phase 6.2 Stream C follow-up task #143"
Write-Host ""
Write-Host "Stream D data layer (ValidatedNodeAclAuthoringService)"
Assert-FileExists "ValidatedNodeAclAuthoringService present" "src/ZB.MOM.WW.OtOpcUa.Admin/Services/ValidatedNodeAclAuthoringService.cs"
Assert-TextFound "InvalidNodeAclGrantException present" "class InvalidNodeAclGrantException" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/ValidatedNodeAclAuthoringService.cs")
Assert-TextFound "Rejects None permissions" "Permission set cannot be None" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/ValidatedNodeAclAuthoringService.cs")
Assert-Deferred "RoleGrantsTab + AclsTab Probe-this-permission + SignalR invalidation + draft diff section" "Phase 6.2 Stream D follow-up task #144"
Write-Host ""
Write-Host "Cross-cutting"
Write-Host " Running full solution test suite..." -ForegroundColor DarkGray
$prevPref = $ErrorActionPreference
$ErrorActionPreference = 'Continue'
$testOutput = & dotnet test (Join-Path $repoRoot 'ZB.MOM.WW.OtOpcUa.slnx') --nologo 2>&1
$ErrorActionPreference = $prevPref
$passLine = $testOutput | Select-String 'Passed:\s+(\d+)' -AllMatches
$failLine = $testOutput | Select-String 'Failed:\s+(\d+)' -AllMatches
$passCount = 0; foreach ($m in $passLine.Matches) { $passCount += [int]$m.Groups[1].Value }
$failCount = 0; foreach ($m in $failLine.Matches) { $failCount += [int]$m.Groups[1].Value }
$baseline = 1042
if ($passCount -ge $baseline) { Assert-Pass "No test-count regression ($passCount >= $baseline pre-Phase-6.2 baseline)" }
else { Assert-Fail "Test-count regression" "passed $passCount < baseline $baseline" }
if ($failCount -le 1) { Assert-Pass "No new failing tests (pre-existing CLI flake tolerated)" }
else { Assert-Fail "New failing tests" "$failCount failures > 1 tolerated" }
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.2 compliance: PASS" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.2 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,110 @@
<#
.SYNOPSIS
Phase 6.3 exit-gate compliance check. Each check either passes or records a
failure; non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.3 (Redundancy runtime) completion. Checks enumerated in
`docs/v2/implementation/phase-6-3-redundancy-runtime.md`
§"Compliance Checks (run at exit gate)".
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-3-compliance.ps1
Exit: 0 = all checks passed; non-zero = one or more FAILs
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
$repoRoot = (Resolve-Path (Join-Path $PSScriptRoot '..\..')).Path
function Assert-Pass { param([string]$C) Write-Host " [PASS] $C" -ForegroundColor Green }
function Assert-Fail { param([string]$C, [string]$R) Write-Host " [FAIL] $C - $R" -ForegroundColor Red; $script:failures++ }
function Assert-Deferred { param([string]$C, [string]$P) Write-Host " [DEFERRED] $C (follow-up: $P)" -ForegroundColor Yellow }
function Assert-FileExists {
param([string]$C, [string]$P)
if (Test-Path (Join-Path $repoRoot $P)) { Assert-Pass "$C ($P)" }
else { Assert-Fail $C "missing file: $P" }
}
function Assert-TextFound {
param([string]$C, [string]$Pat, [string[]]$Paths)
foreach ($p in $Paths) {
$full = Join-Path $repoRoot $p
if (-not (Test-Path $full)) { continue }
if (Select-String -Path $full -Pattern $Pat -Quiet) {
Assert-Pass "$C (matched in $p)"
return
}
}
Assert-Fail $C "pattern '$Pat' not found in any of: $($Paths -join ', ')"
}
Write-Host ""
Write-Host "=== Phase 6.3 compliance - Redundancy runtime ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream B - ServiceLevel 8-state matrix (decision #154)"
Assert-FileExists "ServiceLevelCalculator present" "src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs"
Assert-FileExists "ServiceLevelBand enum present" "src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs"
Assert-TextFound "Maintenance = 0 (reserved per OPC UA Part 5)" "Maintenance\s*=\s*0" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "NoData = 1 (reserved per OPC UA Part 5)" "NoData\s*=\s*1" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "InvalidTopology = 2 (detected-inconsistency band)" "InvalidTopology\s*=\s*2" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "AuthoritativePrimary = 255" "AuthoritativePrimary\s*=\s*255" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "IsolatedPrimary = 230 (retains authority)" "IsolatedPrimary\s*=\s*230" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "PrimaryMidApply = 200" "PrimaryMidApply\s*=\s*200" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "RecoveringPrimary = 180" "RecoveringPrimary\s*=\s*180" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "AuthoritativeBackup = 100" "AuthoritativeBackup\s*=\s*100" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "IsolatedBackup = 80 (does NOT auto-promote)" "IsolatedBackup\s*=\s*80" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "BackupMidApply = 50" "BackupMidApply\s*=\s*50" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Assert-TextFound "RecoveringBackup = 30" "RecoveringBackup\s*=\s*30" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ServiceLevelCalculator.cs")
Write-Host ""
Write-Host "Stream B - RecoveryStateManager"
Assert-FileExists "RecoveryStateManager present" "src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/RecoveryStateManager.cs"
Assert-TextFound "Dwell + publish-witness gate" "_witnessed" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/RecoveryStateManager.cs")
Assert-TextFound "Default dwell 60 s" "FromSeconds\(60\)" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/RecoveryStateManager.cs")
Write-Host ""
Write-Host "Stream D - Apply-lease registry (decision #162)"
Assert-FileExists "ApplyLeaseRegistry present" "src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ApplyLeaseRegistry.cs"
Assert-TextFound "BeginApplyLease returns IAsyncDisposable" "IAsyncDisposable" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ApplyLeaseRegistry.cs")
Assert-TextFound "Lease key includes PublishRequestId" "PublishRequestId" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ApplyLeaseRegistry.cs")
Assert-TextFound "Watchdog PruneStale present" "PruneStale" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ApplyLeaseRegistry.cs")
Assert-TextFound "Default ApplyMaxDuration 10 min" "FromMinutes\(10\)" @("src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/ApplyLeaseRegistry.cs")
Write-Host ""
Write-Host "Deferred surfaces"
Assert-Deferred "Stream A - RedundancyCoordinator cluster-topology loader" "task #145"
Assert-Deferred "Stream C - OPC UA node wiring (ServiceLevel + ServerUriArray + RedundancySupport)" "task #147"
Assert-Deferred "Stream E - Admin RedundancyTab + OpenTelemetry metrics + SignalR" "task #149"
Assert-Deferred "Stream F - Client interop matrix + Galaxy MXAccess failover" "task #150"
Assert-Deferred "sp_PublishGeneration rejects Transparent mode pre-publish" "task #148 part 2 (SQL-side validator)"
Write-Host ""
Write-Host "Cross-cutting"
Write-Host " Running full solution test suite..." -ForegroundColor DarkGray
$prevPref = $ErrorActionPreference
$ErrorActionPreference = 'Continue'
$testOutput = & dotnet test (Join-Path $repoRoot 'ZB.MOM.WW.OtOpcUa.slnx') --nologo 2>&1
$ErrorActionPreference = $prevPref
$passLine = $testOutput | Select-String 'Passed:\s+(\d+)' -AllMatches
$failLine = $testOutput | Select-String 'Failed:\s+(\d+)' -AllMatches
$passCount = 0; foreach ($m in $passLine.Matches) { $passCount += [int]$m.Groups[1].Value }
$failCount = 0; foreach ($m in $failLine.Matches) { $failCount += [int]$m.Groups[1].Value }
$baseline = 1097
if ($passCount -ge $baseline) { Assert-Pass "No test-count regression ($passCount >= $baseline pre-Phase-6.3 baseline)" }
else { Assert-Fail "Test-count regression" "passed $passCount < baseline $baseline" }
if ($failCount -le 1) { Assert-Pass "No new failing tests (pre-existing CLI flake tolerated)" }
else { Assert-Fail "New failing tests" "$failCount failures > 1 tolerated" }
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.3 compliance: PASS" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.3 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,96 @@
<#
.SYNOPSIS
Phase 6.4 exit-gate compliance check. Each check either passes or records a
failure; non-zero exit = fail.
.DESCRIPTION
Validates Phase 6.4 (Admin UI completion) progress. Checks enumerated in
`docs/v2/implementation/phase-6-4-admin-ui-completion.md`
§"Compliance Checks (run at exit gate)".
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-4-compliance.ps1
Exit: 0 = all checks passed; non-zero = one or more FAILs
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Stop'
$script:failures = 0
$repoRoot = (Resolve-Path (Join-Path $PSScriptRoot '..\..')).Path
function Assert-Pass { param([string]$C) Write-Host " [PASS] $C" -ForegroundColor Green }
function Assert-Fail { param([string]$C, [string]$R) Write-Host " [FAIL] $C - $R" -ForegroundColor Red; $script:failures++ }
function Assert-Deferred { param([string]$C, [string]$P) Write-Host " [DEFERRED] $C (follow-up: $P)" -ForegroundColor Yellow }
function Assert-FileExists {
param([string]$C, [string]$P)
if (Test-Path (Join-Path $repoRoot $P)) { Assert-Pass "$C ($P)" }
else { Assert-Fail $C "missing file: $P" }
}
function Assert-TextFound {
param([string]$C, [string]$Pat, [string[]]$Paths)
foreach ($p in $Paths) {
$full = Join-Path $repoRoot $p
if (-not (Test-Path $full)) { continue }
if (Select-String -Path $full -Pattern $Pat -Quiet) {
Assert-Pass "$C (matched in $p)"
return
}
}
Assert-Fail $C "pattern '$Pat' not found in any of: $($Paths -join ', ')"
}
Write-Host ""
Write-Host "=== Phase 6.4 compliance - Admin UI completion ===" -ForegroundColor Cyan
Write-Host ""
Write-Host "Stream A data layer - UnsImpactAnalyzer"
Assert-FileExists "UnsImpactAnalyzer present" "src/ZB.MOM.WW.OtOpcUa.Admin/Services/UnsImpactAnalyzer.cs"
Assert-TextFound "DraftRevisionToken present" "record DraftRevisionToken" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/UnsImpactAnalyzer.cs")
Assert-TextFound "Cross-cluster move rejected per decision #82" "CrossClusterMoveRejectedException" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/UnsImpactAnalyzer.cs")
Assert-TextFound "LineMove + AreaRename + LineMerge covered" "UnsMoveKind\.LineMerge" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/UnsImpactAnalyzer.cs")
Write-Host ""
Write-Host "Stream B data layer - EquipmentCsvImporter"
Assert-FileExists "EquipmentCsvImporter present" "src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs"
Assert-TextFound "CSV header version marker '# OtOpcUaCsv v1'" "OtOpcUaCsv v1" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound "Required columns match decision #117" "ZTag.+MachineCode.+SAPID.+EquipmentId.+EquipmentUuid" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound "Optional columns match decision #139 (Manufacturer)" "Manufacturer" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound "Optional columns include DeviceManualUri" "DeviceManualUri" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound "Rejects duplicate ZTag within file" "Duplicate ZTag" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Assert-TextFound "Rejects unknown column" "unknown column" @("src/ZB.MOM.WW.OtOpcUa.Admin/Services/EquipmentCsvImporter.cs")
Write-Host ""
Write-Host "Deferred surfaces"
Assert-Deferred "Stream A UI - UnsTab MudBlazor drag/drop + 409 modal + Playwright" "task #153"
Assert-Deferred "Stream B follow-up - EquipmentImportBatch staging + FinaliseImportBatch + CSV import UI" "task #155"
Assert-Deferred "Stream C - DiffViewer refactor + 6 section plugins + 1000-row cap" "task #156"
Assert-Deferred "Stream D - IdentificationFields.razor + DriverNodeManager OPC 40010 sub-folder" "task #157"
Write-Host ""
Write-Host "Cross-cutting"
Write-Host " Running full solution test suite..." -ForegroundColor DarkGray
$prevPref = $ErrorActionPreference
$ErrorActionPreference = 'Continue'
$testOutput = & dotnet test (Join-Path $repoRoot 'ZB.MOM.WW.OtOpcUa.slnx') --nologo 2>&1
$ErrorActionPreference = $prevPref
$passLine = $testOutput | Select-String 'Passed:\s+(\d+)' -AllMatches
$failLine = $testOutput | Select-String 'Failed:\s+(\d+)' -AllMatches
$passCount = 0; foreach ($m in $passLine.Matches) { $passCount += [int]$m.Groups[1].Value }
$failCount = 0; foreach ($m in $failLine.Matches) { $failCount += [int]$m.Groups[1].Value }
$baseline = 1137
if ($passCount -ge $baseline) { Assert-Pass "No test-count regression ($passCount >= $baseline pre-Phase-6.4 baseline)" }
else { Assert-Fail "Test-count regression" "passed $passCount < baseline $baseline" }
if ($failCount -le 1) { Assert-Pass "No new failing tests (pre-existing CLI flake tolerated)" }
else { Assert-Fail "New failing tests" "$failCount failures > 1 tolerated" }
Write-Host ""
if ($script:failures -eq 0) {
Write-Host "Phase 6.4 compliance: PASS" -ForegroundColor Green
exit 0
}
Write-Host "Phase 6.4 compliance: $script:failures FAIL(s)" -ForegroundColor Red
exit 1

View File

@@ -0,0 +1,77 @@
<#
.SYNOPSIS
Meta-runner that invokes every per-phase Phase 6.x compliance script and
reports an aggregate verdict.
.DESCRIPTION
Runs phase-6-1-compliance.ps1, phase-6-2, phase-6-3, phase-6-4 in sequence.
Each sub-script returns its own exit code; this wrapper aggregates them.
Useful before a v2 release tag + as the `dotnet test` companion in CI.
.NOTES
Usage: pwsh ./scripts/compliance/phase-6-all.ps1
Exit: 0 = every phase passed; 1 = one or more phases failed
#>
[CmdletBinding()]
param()
$ErrorActionPreference = 'Continue'
$phases = @(
@{ Name = 'Phase 6.1 - Resilience & Observability'; Script = 'phase-6-1-compliance.ps1' },
@{ Name = 'Phase 6.2 - Authorization runtime'; Script = 'phase-6-2-compliance.ps1' },
@{ Name = 'Phase 6.3 - Redundancy runtime'; Script = 'phase-6-3-compliance.ps1' },
@{ Name = 'Phase 6.4 - Admin UI completion'; Script = 'phase-6-4-compliance.ps1' }
)
$results = @()
$startedAt = Get-Date
foreach ($phase in $phases) {
Write-Host ""
Write-Host ""
Write-Host "=============================================================" -ForegroundColor DarkGray
Write-Host ("Running {0}" -f $phase.Name) -ForegroundColor Cyan
Write-Host "=============================================================" -ForegroundColor DarkGray
$scriptPath = Join-Path $PSScriptRoot $phase.Script
if (-not (Test-Path $scriptPath)) {
Write-Host (" [MISSING] {0}" -f $phase.Script) -ForegroundColor Red
$results += @{ Name = $phase.Name; Exit = 2 }
continue
}
# Invoke each sub-script in its own powershell.exe process so its local
# $ErrorActionPreference + exit-code semantics can't interfere with the meta-runner's
# state. Slower (one process spawn per phase) but makes aggregate PASS/FAIL match
# standalone runs exactly.
& powershell.exe -NoProfile -ExecutionPolicy Bypass -File $scriptPath
$exitCode = $LASTEXITCODE
$results += @{ Name = $phase.Name; Exit = $exitCode }
}
$elapsed = (Get-Date) - $startedAt
Write-Host ""
Write-Host ""
Write-Host "=============================================================" -ForegroundColor DarkGray
Write-Host "Phase 6 compliance aggregate" -ForegroundColor Cyan
Write-Host "=============================================================" -ForegroundColor DarkGray
$totalFailures = 0
foreach ($r in $results) {
$colour = if ($r.Exit -eq 0) { 'Green' } else { 'Red' }
$tag = if ($r.Exit -eq 0) { 'PASS' } else { "FAIL (exit=$($r.Exit))" }
Write-Host (" [{0}] {1}" -f $tag, $r.Name) -ForegroundColor $colour
if ($r.Exit -ne 0) { $totalFailures++ }
}
Write-Host ""
Write-Host ("Elapsed: {0:N1} s" -f $elapsed.TotalSeconds) -ForegroundColor DarkGray
if ($totalFailures -eq 0) {
Write-Host "Phase 6 aggregate: PASS" -ForegroundColor Green
exit 0
}
Write-Host ("Phase 6 aggregate: {0} phase(s) FAILED" -f $totalFailures) -ForegroundColor Red
exit 1

View File

@@ -10,6 +10,7 @@
<li class="nav-item"><a class="nav-link text-light" href="/clusters">Clusters</a></li>
<li class="nav-item"><a class="nav-link text-light" href="/reservations">Reservations</a></li>
<li class="nav-item"><a class="nav-link text-light" href="/certificates">Certificates</a></li>
<li class="nav-item"><a class="nav-link text-light" href="/role-grants">Role grants</a></li>
</ul>
<div class="mt-5">

View File

@@ -1,7 +1,13 @@
@using Microsoft.AspNetCore.SignalR.Client
@using ZB.MOM.WW.OtOpcUa.Admin.Hubs
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Enums
@using ZB.MOM.WW.OtOpcUa.Core.Authorization
@inject NodeAclService AclSvc
@inject PermissionProbeService ProbeSvc
@inject NavigationManager Nav
@implements IAsyncDisposable
<div class="d-flex justify-content-between mb-3">
<h4>Access-control grants</h4>
@@ -29,6 +35,95 @@ else
</table>
}
@* Probe-this-permission — task #196 slice 1 *@
<div class="card mt-4 mb-3">
<div class="card-header">
<strong>Probe this permission</strong>
<span class="small text-muted ms-2">
Ask the trie "if LDAP group X asks for permission Y on node Z, would it be granted?" —
answers the same way the live server does at request time.
</span>
</div>
<div class="card-body">
<div class="row g-2 align-items-end">
<div class="col-md-3">
<label class="form-label small">LDAP group</label>
<input class="form-control form-control-sm" @bind="_probeGroup" placeholder="cn=fleet-admin,…"/>
</div>
<div class="col-md-2">
<label class="form-label small">Namespace</label>
<input class="form-control form-control-sm" @bind="_probeNamespaceId" placeholder="ns-1"/>
</div>
<div class="col-md-2">
<label class="form-label small">UnsArea</label>
<input class="form-control form-control-sm" @bind="_probeUnsAreaId"/>
</div>
<div class="col-md-2">
<label class="form-label small">UnsLine</label>
<input class="form-control form-control-sm" @bind="_probeUnsLineId"/>
</div>
<div class="col-md-1">
<label class="form-label small">Equipment</label>
<input class="form-control form-control-sm" @bind="_probeEquipmentId"/>
</div>
<div class="col-md-1">
<label class="form-label small">Tag</label>
<input class="form-control form-control-sm" @bind="_probeTagId"/>
</div>
<div class="col-md-1">
<label class="form-label small">Permission</label>
<select class="form-select form-select-sm" @bind="_probePermission">
@foreach (var p in Enum.GetValues<NodePermissions>())
{
if (p == NodePermissions.None) continue;
<option value="@p">@p</option>
}
</select>
</div>
</div>
<div class="mt-3">
<button class="btn btn-sm btn-outline-primary" @onclick="RunProbeAsync" disabled="@_probing">Probe</button>
@if (_probeResult is not null)
{
<span class="ms-3">
@if (_probeResult.Granted)
{
<span class="badge bg-success">Granted</span>
}
else
{
<span class="badge bg-danger">Denied</span>
}
<span class="small ms-2">
Required <code>@_probeResult.Required</code>,
Effective <code>@_probeResult.Effective</code>
</span>
</span>
}
</div>
@if (_probeResult is not null && _probeResult.Matches.Count > 0)
{
<table class="table table-sm mt-3 mb-0">
<thead><tr><th>LDAP group matched</th><th>Level</th><th>Flags contributed</th></tr></thead>
<tbody>
@foreach (var m in _probeResult.Matches)
{
<tr>
<td><code>@m.LdapGroup</code></td>
<td>@m.Scope</td>
<td><code>@m.PermissionFlags</code></td>
</tr>
}
</tbody>
</table>
}
else if (_probeResult is not null)
{
<div class="mt-2 small text-muted">No matching grants for this (group, scope) — effective permission is <code>None</code>.</div>
}
</div>
</div>
@if (_showForm)
{
<div class="card">
@@ -80,6 +175,64 @@ else
private string _preset = "Read";
private string? _error;
// Probe-this-permission state
private string _probeGroup = string.Empty;
private string _probeNamespaceId = string.Empty;
private string _probeUnsAreaId = string.Empty;
private string _probeUnsLineId = string.Empty;
private string _probeEquipmentId = string.Empty;
private string _probeTagId = string.Empty;
private NodePermissions _probePermission = NodePermissions.Read;
private PermissionProbeResult? _probeResult;
private bool _probing;
private async Task RunProbeAsync()
{
if (string.IsNullOrWhiteSpace(_probeGroup)) { _probeResult = null; return; }
_probing = true;
try
{
var scope = new NodeScope
{
ClusterId = ClusterId,
NamespaceId = NullIfBlank(_probeNamespaceId),
UnsAreaId = NullIfBlank(_probeUnsAreaId),
UnsLineId = NullIfBlank(_probeUnsLineId),
EquipmentId = NullIfBlank(_probeEquipmentId),
TagId = NullIfBlank(_probeTagId),
Kind = NodeHierarchyKind.Equipment,
};
_probeResult = await ProbeSvc.ProbeAsync(GenerationId, _probeGroup.Trim(), scope, _probePermission, CancellationToken.None);
}
finally { _probing = false; }
}
private static string? NullIfBlank(string s) => string.IsNullOrWhiteSpace(s) ? null : s;
private HubConnection? _hub;
protected override async Task OnAfterRenderAsync(bool firstRender)
{
if (!firstRender || _hub is not null) return;
_hub = new HubConnectionBuilder()
.WithUrl(Nav.ToAbsoluteUri("/hubs/fleet-status"))
.WithAutomaticReconnect()
.Build();
_hub.On<NodeAclChangedMessage>("NodeAclChanged", async msg =>
{
if (msg.ClusterId != ClusterId || msg.GenerationId != GenerationId) return;
_acls = await AclSvc.ListAsync(GenerationId, CancellationToken.None);
await InvokeAsync(StateHasChanged);
});
await _hub.StartAsync();
await _hub.SendAsync("SubscribeCluster", ClusterId);
}
public async ValueTask DisposeAsync()
{
if (_hub is not null) { await _hub.DisposeAsync(); _hub = null; }
}
protected override async Task OnParametersSetAsync() =>
_acls = await AclSvc.ListAsync(GenerationId, CancellationToken.None);

View File

@@ -52,6 +52,7 @@ else
<li class="nav-item"><button class="nav-link @Tab("namespaces")" @onclick='() => _tab = "namespaces"'>Namespaces</button></li>
<li class="nav-item"><button class="nav-link @Tab("drivers")" @onclick='() => _tab = "drivers"'>Drivers</button></li>
<li class="nav-item"><button class="nav-link @Tab("acls")" @onclick='() => _tab = "acls"'>ACLs</button></li>
<li class="nav-item"><button class="nav-link @Tab("redundancy")" @onclick='() => _tab = "redundancy"'>Redundancy</button></li>
<li class="nav-item"><button class="nav-link @Tab("audit")" @onclick='() => _tab = "audit"'>Audit</button></li>
</ul>
@@ -92,6 +93,10 @@ else
{
<AclsTab GenerationId="@_currentDraft.GenerationId" ClusterId="@ClusterId"/>
}
else if (_tab == "redundancy")
{
<RedundancyTab ClusterId="@ClusterId"/>
}
else if (_tab == "audit")
{
<AuditTab ClusterId="@ClusterId"/>

View File

@@ -0,0 +1,90 @@
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@* Per-section diff renderer — the base used by DiffViewer for every known TableName. Caps
output at RowCap rows so a pathological draft (e.g. 20k tags churned) can't freeze the
Blazor render; overflow banner tells operator how many rows were hidden. *@
<div class="card mb-3">
<div class="card-header d-flex justify-content-between align-items-center">
<div>
<strong>@Title</strong>
<small class="text-muted ms-2">@Description</small>
</div>
<div>
@if (_added > 0) { <span class="badge bg-success me-1">+@_added</span> }
@if (_removed > 0) { <span class="badge bg-danger me-1">@_removed</span> }
@if (_modified > 0) { <span class="badge bg-warning text-dark me-1">~@_modified</span> }
@if (_total == 0) { <span class="badge bg-secondary">no changes</span> }
</div>
</div>
@if (_total == 0)
{
<div class="card-body text-muted small">No changes in this section.</div>
}
else
{
@if (_total > RowCap)
{
<div class="alert alert-warning mb-0 small rounded-0">
Showing the first @RowCap of @_total rows — cap protects the browser from megabyte-class
diffs. Inspect the remainder via the SQL <code>sp_ComputeGenerationDiff</code> directly.
</div>
}
<div class="table-responsive" style="max-height: 400px; overflow-y: auto;">
<table class="table table-sm table-hover mb-0">
<thead class="table-light">
<tr><th>LogicalId</th><th style="width: 120px;">Change</th></tr>
</thead>
<tbody>
@foreach (var r in _visibleRows)
{
<tr>
<td><code>@r.LogicalId</code></td>
<td>
@switch (r.ChangeKind)
{
case "Added": <span class="badge bg-success">@r.ChangeKind</span> break;
case "Removed": <span class="badge bg-danger">@r.ChangeKind</span> break;
case "Modified": <span class="badge bg-warning text-dark">@r.ChangeKind</span> break;
default: <span class="badge bg-secondary">@r.ChangeKind</span> break;
}
</td>
</tr>
}
</tbody>
</table>
</div>
}
</div>
@code {
/// <summary>Default row-cap per section — matches task #156's acceptance criterion.</summary>
public const int DefaultRowCap = 1000;
[Parameter, EditorRequired] public string Title { get; set; } = string.Empty;
[Parameter] public string Description { get; set; } = string.Empty;
[Parameter, EditorRequired] public IReadOnlyList<DiffRow> Rows { get; set; } = [];
[Parameter] public int RowCap { get; set; } = DefaultRowCap;
private int _total;
private int _added;
private int _removed;
private int _modified;
private List<DiffRow> _visibleRows = [];
protected override void OnParametersSet()
{
_total = Rows.Count;
_added = 0; _removed = 0; _modified = 0;
foreach (var r in Rows)
{
switch (r.ChangeKind)
{
case "Added": _added++; break;
case "Removed": _removed++; break;
case "Modified": _modified++; break;
}
}
_visibleRows = _total > RowCap ? Rows.Take(RowCap).ToList() : Rows.ToList();
}
}

View File

@@ -28,36 +28,44 @@ else if (_rows.Count == 0)
}
else
{
<table class="table table-hover table-sm">
<thead><tr><th>Table</th><th>LogicalId</th><th>ChangeKind</th></tr></thead>
<tbody>
@foreach (var r in _rows)
{
<tr>
<td>@r.TableName</td>
<td><code>@r.LogicalId</code></td>
<td>
@switch (r.ChangeKind)
{
case "Added": <span class="badge bg-success">@r.ChangeKind</span> break;
case "Removed": <span class="badge bg-danger">@r.ChangeKind</span> break;
case "Modified": <span class="badge bg-warning text-dark">@r.ChangeKind</span> break;
default: <span class="badge bg-secondary">@r.ChangeKind</span> break;
}
</td>
</tr>
}
</tbody>
</table>
<p class="small text-muted mb-3">
@_rows.Count row@(_rows.Count == 1 ? "" : "s") across @_sectionsWithChanges of @Sections.Count sections.
Each section is capped at @DiffSection.DefaultRowCap rows to keep the browser responsive on pathological drafts.
</p>
@foreach (var sec in Sections)
{
<DiffSection Title="@sec.Title"
Description="@sec.Description"
Rows="@RowsFor(sec.TableName)"/>
}
}
@code {
[Parameter] public string ClusterId { get; set; } = string.Empty;
[Parameter] public long GenerationId { get; set; }
/// <summary>
/// Ordered section definitions — each maps a <c>TableName</c> emitted by
/// <c>sp_ComputeGenerationDiff</c> to a human label + description. The proc currently
/// emits Namespace/DriverInstance/Equipment/Tag; UnsLine + NodeAcl entries render as
/// empty "no changes" cards until the proc is extended (tracked in tasks #196 + #156
/// follow-up). Six sections total matches the task #156 target.
/// </summary>
private static readonly IReadOnlyList<SectionDef> Sections = new[]
{
new SectionDef("Namespace", "Namespaces", "OPC UA namespace URIs + enablement"),
new SectionDef("DriverInstance", "Driver instances","Per-cluster driver configuration rows"),
new SectionDef("Equipment", "Equipment", "UNS level-5 rows + identification fields"),
new SectionDef("Tag", "Tags", "Per-device tag definitions + poll-group binding"),
new SectionDef("UnsLine", "UNS structure", "Site / Area / Line hierarchy (proc-extension pending)"),
new SectionDef("NodeAcl", "ACLs", "LDAP-group → node-scope permission grants (logical id = LdapGroup|ScopeKind|ScopeId)"),
};
private List<DiffRow>? _rows;
private string _fromLabel = "(empty)";
private string? _error;
private int _sectionsWithChanges;
protected override async Task OnParametersSetAsync()
{
@@ -67,7 +75,13 @@ else
var from = all.FirstOrDefault(g => g.Status == GenerationStatus.Published);
_fromLabel = from is null ? "(empty)" : $"gen {from.GenerationId}";
_rows = await GenerationSvc.ComputeDiffAsync(from?.GenerationId ?? 0, GenerationId, CancellationToken.None);
_sectionsWithChanges = Sections.Count(s => _rows.Any(r => r.TableName == s.TableName));
}
catch (Exception ex) { _error = ex.Message; }
}
private IReadOnlyList<DiffRow> RowsFor(string tableName) =>
_rows?.Where(r => r.TableName == tableName).ToList() ?? [];
private sealed record SectionDef(string TableName, string Title, string Description);
}

View File

@@ -27,7 +27,7 @@
<div class="row">
<div class="col-md-8">
@if (_tab == "equipment") { <EquipmentTab GenerationId="@GenerationId"/> }
@if (_tab == "equipment") { <EquipmentTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }
else if (_tab == "uns") { <UnsTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }
else if (_tab == "namespaces") { <NamespacesTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }
else if (_tab == "drivers") { <DriversTab GenerationId="@GenerationId" ClusterId="@ClusterId"/> }

View File

@@ -2,10 +2,14 @@
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Validation
@inject EquipmentService EquipmentSvc
@inject NavigationManager Nav
<div class="d-flex justify-content-between mb-3">
<h4>Equipment (draft gen @GenerationId)</h4>
<button class="btn btn-primary btn-sm" @onclick="StartAdd">Add equipment</button>
<div>
<button class="btn btn-outline-primary btn-sm me-2" @onclick="GoImport">Import CSV…</button>
<button class="btn btn-primary btn-sm" @onclick="StartAdd">Add equipment</button>
</div>
</div>
@if (_equipment is null)
@@ -36,7 +40,10 @@ else if (_equipment.Count > 0)
<td>@e.SAPID</td>
<td>@e.Manufacturer / @e.Model</td>
<td>@e.SerialNumber</td>
<td><button class="btn btn-sm btn-outline-danger" @onclick="() => DeleteAsync(e.EquipmentRowId)">Remove</button></td>
<td>
<button class="btn btn-sm btn-outline-secondary me-1" @onclick="() => StartEdit(e)">Edit</button>
<button class="btn btn-sm btn-outline-danger" @onclick="() => DeleteAsync(e.EquipmentRowId)">Remove</button>
</td>
</tr>
}
</tbody>
@@ -47,8 +54,8 @@ else if (_equipment.Count > 0)
{
<div class="card mt-3">
<div class="card-body">
<h5>New equipment</h5>
<EditForm Model="_draft" OnValidSubmit="SaveAsync" FormName="new-equipment">
<h5>@(_editMode ? "Edit equipment" : "New equipment")</h5>
<EditForm Model="_draft" OnValidSubmit="SaveAsync" FormName="equipment-form">
<DataAnnotationsValidator/>
<div class="row g-3">
<div class="col-md-4">
@@ -78,24 +85,13 @@ else if (_equipment.Count > 0)
</div>
</div>
<h6 class="mt-4">OPC 40010 Identification</h6>
<div class="row g-3">
<div class="col-md-4"><label class="form-label">Manufacturer</label><InputText @bind-Value="_draft.Manufacturer" class="form-control"/></div>
<div class="col-md-4"><label class="form-label">Model</label><InputText @bind-Value="_draft.Model" class="form-control"/></div>
<div class="col-md-4"><label class="form-label">Serial number</label><InputText @bind-Value="_draft.SerialNumber" class="form-control"/></div>
<div class="col-md-4"><label class="form-label">Hardware rev</label><InputText @bind-Value="_draft.HardwareRevision" class="form-control"/></div>
<div class="col-md-4"><label class="form-label">Software rev</label><InputText @bind-Value="_draft.SoftwareRevision" class="form-control"/></div>
<div class="col-md-4">
<label class="form-label">Year of construction</label>
<InputNumber @bind-Value="_draft.YearOfConstruction" class="form-control"/>
</div>
</div>
<IdentificationFields Equipment="_draft"/>
@if (_error is not null) { <div class="alert alert-danger mt-3">@_error</div> }
<div class="mt-3">
<button type="submit" class="btn btn-primary btn-sm">Save</button>
<button type="button" class="btn btn-secondary btn-sm ms-2" @onclick="() => _showForm = false">Cancel</button>
<button type="button" class="btn btn-secondary btn-sm ms-2" @onclick="Cancel">Cancel</button>
</div>
</EditForm>
</div>
@@ -104,8 +100,12 @@ else if (_equipment.Count > 0)
@code {
[Parameter] public long GenerationId { get; set; }
[Parameter] public string ClusterId { get; set; } = string.Empty;
private void GoImport() => Nav.NavigateTo($"/clusters/{ClusterId}/draft/{GenerationId}/import-equipment");
private List<Equipment>? _equipment;
private bool _showForm;
private bool _editMode;
private Equipment _draft = NewBlankDraft();
private string? _error;
@@ -125,20 +125,68 @@ else if (_equipment.Count > 0)
private void StartAdd()
{
_draft = NewBlankDraft();
_editMode = false;
_error = null;
_showForm = true;
}
private void StartEdit(Equipment row)
{
// Shallow-clone so Cancel doesn't mutate the list-displayed row with in-flight form edits.
_draft = new Equipment
{
EquipmentRowId = row.EquipmentRowId,
GenerationId = row.GenerationId,
EquipmentId = row.EquipmentId,
EquipmentUuid = row.EquipmentUuid,
DriverInstanceId = row.DriverInstanceId,
DeviceId = row.DeviceId,
UnsLineId = row.UnsLineId,
Name = row.Name,
MachineCode = row.MachineCode,
ZTag = row.ZTag,
SAPID = row.SAPID,
Manufacturer = row.Manufacturer,
Model = row.Model,
SerialNumber = row.SerialNumber,
HardwareRevision = row.HardwareRevision,
SoftwareRevision = row.SoftwareRevision,
YearOfConstruction = row.YearOfConstruction,
AssetLocation = row.AssetLocation,
ManufacturerUri = row.ManufacturerUri,
DeviceManualUri = row.DeviceManualUri,
EquipmentClassRef = row.EquipmentClassRef,
Enabled = row.Enabled,
};
_editMode = true;
_error = null;
_showForm = true;
}
private void Cancel()
{
_showForm = false;
_editMode = false;
}
private async Task SaveAsync()
{
_error = null;
_draft.EquipmentUuid = Guid.NewGuid();
_draft.EquipmentId = DraftValidator.DeriveEquipmentId(_draft.EquipmentUuid);
_draft.GenerationId = GenerationId;
try
{
await EquipmentSvc.CreateAsync(GenerationId, _draft, CancellationToken.None);
if (_editMode)
{
await EquipmentSvc.UpdateAsync(_draft, CancellationToken.None);
}
else
{
_draft.EquipmentUuid = Guid.NewGuid();
_draft.EquipmentId = DraftValidator.DeriveEquipmentId(_draft.EquipmentUuid);
_draft.GenerationId = GenerationId;
await EquipmentSvc.CreateAsync(GenerationId, _draft, CancellationToken.None);
}
_showForm = false;
_editMode = false;
await ReloadAsync();
}
catch (Exception ex) { _error = ex.Message; }

View File

@@ -0,0 +1,49 @@
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@* Reusable OPC 40010 Machinery Identification editor. Binds to an Equipment row and renders the
nine decision #139 fields in a consistent 3-column Bootstrap grid. Used by EquipmentTab's
create + edit forms so the same UI renders regardless of which flow opened it. *@
<h6 class="mt-4">OPC 40010 Identification</h6>
<div class="row g-3">
<div class="col-md-4">
<label class="form-label">Manufacturer</label>
<InputText @bind-Value="Equipment!.Manufacturer" class="form-control"/>
</div>
<div class="col-md-4">
<label class="form-label">Model</label>
<InputText @bind-Value="Equipment!.Model" class="form-control"/>
</div>
<div class="col-md-4">
<label class="form-label">Serial number</label>
<InputText @bind-Value="Equipment!.SerialNumber" class="form-control"/>
</div>
<div class="col-md-4">
<label class="form-label">Hardware rev</label>
<InputText @bind-Value="Equipment!.HardwareRevision" class="form-control"/>
</div>
<div class="col-md-4">
<label class="form-label">Software rev</label>
<InputText @bind-Value="Equipment!.SoftwareRevision" class="form-control"/>
</div>
<div class="col-md-4">
<label class="form-label">Year of construction</label>
<InputNumber @bind-Value="Equipment!.YearOfConstruction" class="form-control"/>
</div>
<div class="col-md-4">
<label class="form-label">Asset location</label>
<InputText @bind-Value="Equipment!.AssetLocation" class="form-control"/>
</div>
<div class="col-md-4">
<label class="form-label">Manufacturer URI</label>
<InputText @bind-Value="Equipment!.ManufacturerUri" class="form-control" placeholder="https://…"/>
</div>
<div class="col-md-4">
<label class="form-label">Device manual URI</label>
<InputText @bind-Value="Equipment!.DeviceManualUri" class="form-control" placeholder="https://…"/>
</div>
</div>
@code {
[Parameter, EditorRequired] public Equipment? Equipment { get; set; }
}

View File

@@ -0,0 +1,200 @@
@page "/clusters/{ClusterId}/draft/{GenerationId:long}/import-equipment"
@using Microsoft.AspNetCore.Components.Authorization
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@inject DriverInstanceService DriverSvc
@inject UnsService UnsSvc
@inject EquipmentImportBatchService BatchSvc
@inject NavigationManager Nav
@inject AuthenticationStateProvider AuthProvider
<div class="d-flex justify-content-between align-items-center mb-3">
<div>
<h1 class="mb-0">Equipment CSV import</h1>
<small class="text-muted">Cluster <code>@ClusterId</code> · draft generation @GenerationId</small>
</div>
<a class="btn btn-outline-secondary" href="/clusters/@ClusterId/draft/@GenerationId">Back to draft</a>
</div>
<div class="alert alert-info small mb-3">
Accepts <code>@EquipmentCsvImporter.VersionMarker</code>-headered CSV per Stream B.3.
Required columns: @string.Join(", ", EquipmentCsvImporter.RequiredColumns).
Optional columns cover the OPC 40010 Identification fields. Paste the file contents
or upload directly — the parser runs client-stream-side and shows a row-level preview
before anything lands in the draft. ZTag + SAPID uniqueness across the fleet is NOT
enforced here yet (see task #197); for now the finalise may fail at commit time if a
reservation conflict exists.
</div>
<div class="card mb-3">
<div class="card-body">
<div class="row g-3">
<div class="col-md-5">
<label class="form-label">Target driver instance (for every accepted row)</label>
<select class="form-select" @bind="_driverInstanceId">
<option value="">-- select driver --</option>
@if (_drivers is not null)
{
@foreach (var d in _drivers) { <option value="@d.DriverInstanceId">@d.DriverInstanceId</option> }
}
</select>
</div>
<div class="col-md-5">
<label class="form-label">Target UNS line (for every accepted row)</label>
<select class="form-select" @bind="_unsLineId">
<option value="">-- select line --</option>
@if (_unsLines is not null)
{
@foreach (var l in _unsLines) { <option value="@l.UnsLineId">@l.UnsLineId — @l.Name</option> }
}
</select>
</div>
<div class="col-md-2 pt-4">
<InputFile OnChange="HandleFileAsync" class="form-control form-control-sm" accept=".csv,.txt"/>
</div>
</div>
<div class="mt-3">
<label class="form-label">CSV content (paste or uploaded)</label>
<textarea class="form-control font-monospace" rows="8" @bind="_csvText"
placeholder="# OtOpcUaCsv v1&#10;ZTag,MachineCode,SAPID,EquipmentId,…"/>
</div>
<div class="mt-3">
<button class="btn btn-sm btn-outline-primary" @onclick="ParseAsync" disabled="@_busy">Parse</button>
<button class="btn btn-sm btn-primary ms-2" @onclick="StageAndFinaliseAsync"
disabled="@(_parseResult is null || _parseResult.AcceptedRows.Count == 0 || string.IsNullOrWhiteSpace(_driverInstanceId) || string.IsNullOrWhiteSpace(_unsLineId) || _busy)">
Stage + Finalise
</button>
@if (_parseError is not null) { <span class="alert alert-danger ms-3 py-1 px-2 small">@_parseError</span> }
@if (_result is not null) { <span class="alert alert-success ms-3 py-1 px-2 small">@_result</span> }
</div>
</div>
</div>
@if (_parseResult is not null)
{
<div class="row g-3">
<div class="col-md-6">
<div class="card">
<div class="card-header bg-success text-white">
Accepted (@_parseResult.AcceptedRows.Count)
</div>
<div class="card-body p-0" style="max-height: 400px; overflow-y: auto;">
@if (_parseResult.AcceptedRows.Count == 0)
{
<p class="text-muted p-3 mb-0">No accepted rows.</p>
}
else
{
<table class="table table-sm table-striped mb-0">
<thead>
<tr><th>ZTag</th><th>Machine</th><th>Name</th><th>Line</th></tr>
</thead>
<tbody>
@foreach (var r in _parseResult.AcceptedRows)
{
<tr>
<td><code>@r.ZTag</code></td>
<td>@r.MachineCode</td>
<td>@r.Name</td>
<td>@r.UnsLineName</td>
</tr>
}
</tbody>
</table>
}
</div>
</div>
</div>
<div class="col-md-6">
<div class="card">
<div class="card-header bg-danger text-white">
Rejected (@_parseResult.RejectedRows.Count)
</div>
<div class="card-body p-0" style="max-height: 400px; overflow-y: auto;">
@if (_parseResult.RejectedRows.Count == 0)
{
<p class="text-muted p-3 mb-0">No rejections.</p>
}
else
{
<table class="table table-sm table-striped mb-0">
<thead><tr><th>Line</th><th>Reason</th></tr></thead>
<tbody>
@foreach (var e in _parseResult.RejectedRows)
{
<tr>
<td>@e.LineNumber</td>
<td class="small">@e.Reason</td>
</tr>
}
</tbody>
</table>
}
</div>
</div>
</div>
</div>
}
@code {
[Parameter] public string ClusterId { get; set; } = string.Empty;
[Parameter] public long GenerationId { get; set; }
private List<DriverInstance>? _drivers;
private List<UnsLine>? _unsLines;
private string _driverInstanceId = string.Empty;
private string _unsLineId = string.Empty;
private string _csvText = string.Empty;
private EquipmentCsvParseResult? _parseResult;
private string? _parseError;
private string? _result;
private bool _busy;
protected override async Task OnInitializedAsync()
{
_drivers = await DriverSvc.ListAsync(GenerationId, CancellationToken.None);
_unsLines = await UnsSvc.ListLinesAsync(GenerationId, CancellationToken.None);
}
private async Task HandleFileAsync(InputFileChangeEventArgs e)
{
// 5 MiB cap — refuses pathological uploads that would OOM the server.
using var stream = e.File.OpenReadStream(maxAllowedSize: 5 * 1024 * 1024);
using var reader = new StreamReader(stream);
_csvText = await reader.ReadToEndAsync();
}
private void ParseAsync()
{
_parseError = null;
_parseResult = null;
_result = null;
try { _parseResult = EquipmentCsvImporter.Parse(_csvText); }
catch (InvalidCsvFormatException ex) { _parseError = ex.Message; }
catch (Exception ex) { _parseError = $"Parse failed: {ex.Message}"; }
}
private async Task StageAndFinaliseAsync()
{
if (_parseResult is null) return;
_busy = true;
_result = null;
_parseError = null;
try
{
var auth = await AuthProvider.GetAuthenticationStateAsync();
var createdBy = auth.User.Identity?.Name ?? "unknown";
var batch = await BatchSvc.CreateBatchAsync(ClusterId, createdBy, CancellationToken.None);
await BatchSvc.StageRowsAsync(batch.Id, _parseResult.AcceptedRows, _parseResult.RejectedRows, CancellationToken.None);
await BatchSvc.FinaliseBatchAsync(batch.Id, GenerationId, _driverInstanceId, _unsLineId, CancellationToken.None);
_result = $"Finalised batch {batch.Id:N} — {_parseResult.AcceptedRows.Count} rows added.";
// Pause 600 ms so the success banner is visible, then navigate back.
await Task.Delay(600);
Nav.NavigateTo($"/clusters/{ClusterId}/draft/{GenerationId}");
}
catch (Exception ex) { _parseError = $"Finalise failed: {ex.Message}"; }
finally { _busy = false; }
}
}

View File

@@ -0,0 +1,175 @@
@using Microsoft.AspNetCore.SignalR.Client
@using ZB.MOM.WW.OtOpcUa.Admin.Hubs
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Enums
@inject ClusterNodeService NodeSvc
@inject NavigationManager Nav
@implements IAsyncDisposable
<h4>Redundancy topology</h4>
@if (_roleChangedBanner is not null)
{
<div class="alert alert-info small mb-2">@_roleChangedBanner</div>
}
<p class="text-muted small">
One row per <code>ClusterNode</code> in this cluster. Role, <code>ApplicationUri</code>,
and <code>ServiceLevelBase</code> are authored separately; the Admin UI shows them read-only
here so operators can confirm the published topology without touching it. LastSeen older than
@((int)ClusterNodeService.StaleThreshold.TotalSeconds)s is flagged Stale — the node has
stopped heart-beating and is likely down. Role swap goes through the server-side
<code>RedundancyCoordinator</code> apply-lease flow, not direct DB edits.
</p>
@if (_nodes is null)
{
<p>Loading…</p>
}
else if (_nodes.Count == 0)
{
<div class="alert alert-warning">
No ClusterNode rows for this cluster. The server process needs at least one entry
(with a non-blank <code>ApplicationUri</code>) before it can start up per OPC UA spec.
</div>
}
else
{
var primaries = _nodes.Count(n => n.RedundancyRole == RedundancyRole.Primary);
var secondaries = _nodes.Count(n => n.RedundancyRole == RedundancyRole.Secondary);
var standalone = _nodes.Count(n => n.RedundancyRole == RedundancyRole.Standalone);
var staleCount = _nodes.Count(ClusterNodeService.IsStale);
<div class="row g-3 mb-4">
<div class="col-md-3"><div class="card"><div class="card-body">
<h6 class="text-muted mb-1">Nodes</h6>
<div class="fs-3">@_nodes.Count</div>
</div></div></div>
<div class="col-md-3"><div class="card border-success"><div class="card-body">
<h6 class="text-muted mb-1">Primary</h6>
<div class="fs-3 text-success">@primaries</div>
</div></div></div>
<div class="col-md-3"><div class="card border-info"><div class="card-body">
<h6 class="text-muted mb-1">Secondary</h6>
<div class="fs-3 text-info">@secondaries</div>
</div></div></div>
<div class="col-md-3"><div class="card @(staleCount > 0 ? "border-warning" : "")"><div class="card-body">
<h6 class="text-muted mb-1">Stale</h6>
<div class="fs-3 @(staleCount > 0 ? "text-warning" : "")">@staleCount</div>
</div></div></div>
</div>
@if (primaries == 0 && standalone == 0)
{
<div class="alert alert-danger small mb-3">
No Primary or Standalone node — the cluster has no authoritative write target. Secondaries
stay read-only until one of them gets promoted via <code>RedundancyCoordinator</code>.
</div>
}
else if (primaries > 1)
{
<div class="alert alert-danger small mb-3">
<strong>Split-brain:</strong> @primaries nodes claim the Primary role. Apply-lease
enforcement should have made this impossible at the coordinator level. Investigate
immediately — one of the rows was likely hand-edited.
</div>
}
<table class="table table-sm table-hover align-middle">
<thead>
<tr>
<th>Node</th>
<th>Role</th>
<th>Host</th>
<th class="text-end">OPC UA port</th>
<th class="text-end">ServiceLevel base</th>
<th>ApplicationUri</th>
<th>Enabled</th>
<th>Last seen</th>
</tr>
</thead>
<tbody>
@foreach (var n in _nodes)
{
<tr class="@RowClass(n)">
<td><code>@n.NodeId</code></td>
<td><span class="badge @RoleBadge(n.RedundancyRole)">@n.RedundancyRole</span></td>
<td>@n.Host</td>
<td class="text-end"><code>@n.OpcUaPort</code></td>
<td class="text-end">@n.ServiceLevelBase</td>
<td class="small text-break"><code>@n.ApplicationUri</code></td>
<td>
@if (n.Enabled) { <span class="badge bg-success">Enabled</span> }
else { <span class="badge bg-secondary">Disabled</span> }
</td>
<td class="small @(ClusterNodeService.IsStale(n) ? "text-warning fw-bold" : "")">
@(n.LastSeenAt is null ? "never" : FormatAge(n.LastSeenAt.Value))
@if (ClusterNodeService.IsStale(n)) { <span class="badge bg-warning text-dark ms-1">Stale</span> }
</td>
</tr>
}
</tbody>
</table>
}
@code {
[Parameter] public string ClusterId { get; set; } = string.Empty;
private List<ClusterNode>? _nodes;
private HubConnection? _hub;
private string? _roleChangedBanner;
protected override async Task OnParametersSetAsync()
{
_nodes = await NodeSvc.ListByClusterAsync(ClusterId, CancellationToken.None);
if (_hub is null) await ConnectHubAsync();
}
private async Task ConnectHubAsync()
{
_hub = new HubConnectionBuilder()
.WithUrl(Nav.ToAbsoluteUri("/hubs/fleet-status"))
.WithAutomaticReconnect()
.Build();
_hub.On<RoleChangedMessage>("RoleChanged", async msg =>
{
if (msg.ClusterId != ClusterId) return;
_roleChangedBanner = $"Role changed on {msg.NodeId}: {msg.FromRole} → {msg.ToRole} at {msg.ObservedAtUtc:HH:mm:ss 'UTC'}";
_nodes = await NodeSvc.ListByClusterAsync(ClusterId, CancellationToken.None);
await InvokeAsync(StateHasChanged);
});
await _hub.StartAsync();
await _hub.SendAsync("SubscribeCluster", ClusterId);
}
public async ValueTask DisposeAsync()
{
if (_hub is not null)
{
await _hub.DisposeAsync();
_hub = null;
}
}
private static string RowClass(ClusterNode n) =>
ClusterNodeService.IsStale(n) ? "table-warning" :
!n.Enabled ? "table-secondary" : "";
private static string RoleBadge(RedundancyRole r) => r switch
{
RedundancyRole.Primary => "bg-success",
RedundancyRole.Secondary => "bg-info",
RedundancyRole.Standalone => "bg-primary",
_ => "bg-secondary",
};
private static string FormatAge(DateTime t)
{
var age = DateTime.UtcNow - t;
if (age.TotalSeconds < 60) return $"{(int)age.TotalSeconds}s ago";
if (age.TotalMinutes < 60) return $"{(int)age.TotalMinutes}m ago";
if (age.TotalHours < 24) return $"{(int)age.TotalHours}h ago";
return t.ToString("yyyy-MM-dd HH:mm 'UTC'");
}
}

View File

@@ -2,6 +2,13 @@
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@inject UnsService UnsSvc
<div class="alert alert-info small mb-3">
Drag any line in the <strong>UNS Lines</strong> table onto an area row in <strong>UNS Areas</strong>
to re-parent it. A preview modal shows the impact (equipment re-home count) + lets you confirm
or cancel. If another operator modifies the draft while you're confirming, you'll see a 409
refresh-required modal instead of clobbering their work.
</div>
<div class="row">
<div class="col-md-6">
<div class="d-flex justify-content-between mb-2">
@@ -14,11 +21,20 @@
else
{
<table class="table table-sm">
<thead><tr><th>AreaId</th><th>Name</th></tr></thead>
<thead><tr><th>AreaId</th><th>Name</th><th class="small text-muted">(drop target)</th></tr></thead>
<tbody>
@foreach (var a in _areas)
{
<tr><td><code>@a.UnsAreaId</code></td><td>@a.Name</td></tr>
<tr class="@(_hoverAreaId == a.UnsAreaId ? "table-primary" : "")"
@ondragover="e => OnAreaDragOver(e, a.UnsAreaId)"
@ondragover:preventDefault
@ondragleave="() => _hoverAreaId = null"
@ondrop="() => OnLineDroppedAsync(a.UnsAreaId)"
@ondrop:preventDefault>
<td><code>@a.UnsAreaId</code></td>
<td>@a.Name</td>
<td class="small text-muted">drop here</td>
</tr>
}
</tbody>
</table>
@@ -35,6 +51,7 @@
</div>
}
</div>
<div class="col-md-6">
<div class="d-flex justify-content-between mb-2">
<h4>UNS Lines</h4>
@@ -50,7 +67,14 @@
<tbody>
@foreach (var l in _lines)
{
<tr><td><code>@l.UnsLineId</code></td><td><code>@l.UnsAreaId</code></td><td>@l.Name</td></tr>
<tr draggable="true"
@ondragstart="() => _dragLineId = l.UnsLineId"
@ondragend="() => { _dragLineId = null; _hoverAreaId = null; }"
style="cursor: grab;">
<td><code>@l.UnsLineId</code></td>
<td><code>@l.UnsAreaId</code></td>
<td>@l.Name</td>
</tr>
}
</tbody>
</table>
@@ -75,6 +99,64 @@
</div>
</div>
@* Preview / confirm modal for a pending drag-drop move *@
@if (_pendingPreview is not null)
{
<div class="modal show d-block" tabindex="-1" style="background-color: rgba(0,0,0,0.5);">
<div class="modal-dialog">
<div class="modal-content">
<div class="modal-header">
<h5 class="modal-title">Confirm UNS move</h5>
<button type="button" class="btn-close" @onclick="CancelMove"></button>
</div>
<div class="modal-body">
<p>@_pendingPreview.HumanReadableSummary</p>
<p class="text-muted small">
Equipment re-homed: <strong>@_pendingPreview.AffectedEquipmentCount</strong>.
Tags re-parented: <strong>@_pendingPreview.AffectedTagCount</strong>.
</p>
@if (_pendingPreview.CascadeWarnings.Count > 0)
{
<div class="alert alert-warning small mb-0">
<ul class="mb-0">
@foreach (var w in _pendingPreview.CascadeWarnings) { <li>@w</li> }
</ul>
</div>
}
</div>
<div class="modal-footer">
<button class="btn btn-secondary" @onclick="CancelMove">Cancel</button>
<button class="btn btn-primary" @onclick="ConfirmMoveAsync" disabled="@_committing">Confirm move</button>
</div>
</div>
</div>
</div>
}
@* 409 concurrent-edit modal — another operator changed the draft between preview + commit *@
@if (_conflictMessage is not null)
{
<div class="modal show d-block" tabindex="-1" style="background-color: rgba(0,0,0,0.5);">
<div class="modal-dialog">
<div class="modal-content border-danger">
<div class="modal-header bg-danger text-white">
<h5 class="modal-title">Draft changed — refresh required</h5>
</div>
<div class="modal-body">
<p>@_conflictMessage</p>
<p class="small text-muted">
Concurrency guard per DraftRevisionToken prevented overwriting the peer
operator's edit. Reload the tab + redo the move on the current draft state.
</p>
</div>
<div class="modal-footer">
<button class="btn btn-primary" @onclick="ReloadAfterConflict">Reload draft</button>
</div>
</div>
</div>
</div>
}
@code {
[Parameter] public long GenerationId { get; set; }
[Parameter] public string ClusterId { get; set; } = string.Empty;
@@ -87,6 +169,13 @@
private string _newLineName = string.Empty;
private string _newLineAreaId = string.Empty;
private string? _dragLineId;
private string? _hoverAreaId;
private UnsImpactPreview? _pendingPreview;
private UnsMoveOperation? _pendingMove;
private bool _committing;
private string? _conflictMessage;
protected override async Task OnParametersSetAsync() => await ReloadAsync();
private async Task ReloadAsync()
@@ -112,4 +201,72 @@
_showLineForm = false;
await ReloadAsync();
}
private void OnAreaDragOver(DragEventArgs _, string areaId) => _hoverAreaId = areaId;
private async Task OnLineDroppedAsync(string targetAreaId)
{
var lineId = _dragLineId;
_hoverAreaId = null;
_dragLineId = null;
if (string.IsNullOrWhiteSpace(lineId)) return;
var line = _lines?.FirstOrDefault(l => l.UnsLineId == lineId);
if (line is null || line.UnsAreaId == targetAreaId) return;
var snapshot = await UnsSvc.LoadSnapshotAsync(GenerationId, CancellationToken.None);
var move = new UnsMoveOperation(
Kind: UnsMoveKind.LineMove,
SourceClusterId: ClusterId,
TargetClusterId: ClusterId,
SourceLineId: lineId,
TargetAreaId: targetAreaId);
try
{
_pendingPreview = UnsImpactAnalyzer.Analyze(snapshot, move);
_pendingMove = move;
}
catch (Exception ex)
{
_conflictMessage = ex.Message; // CrossCluster or validation failure surfaces here
}
}
private void CancelMove()
{
_pendingPreview = null;
_pendingMove = null;
}
private async Task ConfirmMoveAsync()
{
if (_pendingPreview is null || _pendingMove is null) return;
_committing = true;
try
{
await UnsSvc.MoveLineAsync(
GenerationId,
_pendingPreview.RevisionToken,
_pendingMove.SourceLineId!,
_pendingMove.TargetAreaId!,
CancellationToken.None);
_pendingPreview = null;
_pendingMove = null;
await ReloadAsync();
}
catch (DraftRevisionConflictException ex)
{
_pendingPreview = null;
_pendingMove = null;
_conflictMessage = ex.Message;
}
finally { _committing = false; }
}
private async Task ReloadAfterConflict()
{
_conflictMessage = null;
await ReloadAsync();
}
}

View File

@@ -56,6 +56,16 @@ else
</div></div></div>
</div>
@if (_rows.Any(HostStatusService.IsFlagged))
{
var flaggedCount = _rows.Count(HostStatusService.IsFlagged);
<div class="alert alert-danger small mb-3">
<strong>@flaggedCount host@(flaggedCount == 1 ? "" : "s")</strong>
reporting ≥ @HostStatusService.FailureFlagThreshold consecutive failures — circuit breaker
may trip soon. Inspect the resilience columns below to locate.
</div>
}
@foreach (var cluster in _rows.GroupBy(r => r.ClusterId ?? "(unassigned)").OrderBy(g => g.Key))
{
<h2 class="h5 mt-4">Cluster: <code>@cluster.Key</code></h2>
@@ -66,6 +76,9 @@ else
<th>Driver</th>
<th>Host</th>
<th>State</th>
<th class="text-end" title="Consecutive failures — resets when a call succeeds or the breaker closes">Fail#</th>
<th class="text-end" title="In-flight capability calls (bulkhead-depth proxy)">In-flight</th>
<th>Breaker opened</th>
<th>Last transition</th>
<th>Last seen</th>
<th>Detail</th>
@@ -84,10 +97,21 @@ else
{
<span class="badge bg-warning text-dark ms-1">Stale</span>
}
@if (HostStatusService.IsFlagged(r))
{
<span class="badge bg-danger ms-1" title="≥ @HostStatusService.FailureFlagThreshold consecutive failures">Flagged</span>
}
</td>
<td class="text-end small @(HostStatusService.IsFlagged(r) ? "text-danger fw-bold" : "")">
@r.ConsecutiveFailures
</td>
<td class="text-end small">@r.CurrentBulkheadDepth</td>
<td class="small">
@(r.LastCircuitBreakerOpenUtc is null ? "—" : FormatAge(r.LastCircuitBreakerOpenUtc.Value))
</td>
<td class="small">@FormatAge(r.StateChangedUtc)</td>
<td class="small @(HostStatusService.IsStale(r) ? "text-warning" : "")">@FormatAge(r.LastSeenUtc)</td>
<td class="text-truncate small" style="max-width: 320px;" title="@r.Detail">@r.Detail</td>
<td class="text-truncate small" style="max-width: 240px;" title="@r.Detail">@r.Detail</td>
</tr>
}
</tbody>

View File

@@ -0,0 +1,192 @@
@page "/role-grants"
@using Microsoft.AspNetCore.Components.Web
@using Microsoft.AspNetCore.SignalR.Client
@using ZB.MOM.WW.OtOpcUa.Admin.Hubs
@using ZB.MOM.WW.OtOpcUa.Admin.Services
@using ZB.MOM.WW.OtOpcUa.Configuration.Entities
@using ZB.MOM.WW.OtOpcUa.Configuration.Enums
@using ZB.MOM.WW.OtOpcUa.Configuration.Services
@inject ILdapGroupRoleMappingService RoleSvc
@inject ClusterService ClusterSvc
@inject AclChangeNotifier Notifier
@inject NavigationManager Nav
@implements IAsyncDisposable
<h1 class="mb-4">LDAP group → Admin role grants</h1>
<div class="alert alert-info small mb-4">
Maps LDAP groups to Admin UI roles (ConfigViewer / ConfigEditor / FleetAdmin). Control-plane
only — OPC UA data-path authorization reads <code>NodeAcl</code> rows directly and is
unaffected by these mappings (see decision #150). A fleet-wide grant applies across every
cluster; a cluster-scoped grant only binds within the named cluster. The same LDAP group
may hold different roles on different clusters.
</div>
<div class="d-flex justify-content-end mb-3">
<button class="btn btn-primary btn-sm" @onclick="StartAdd">Add grant</button>
</div>
@if (_rows is null)
{
<p>Loading…</p>
}
else if (_rows.Count == 0)
{
<p class="text-muted">No role grants defined yet. Without at least one FleetAdmin grant,
only the bootstrap admin can publish drafts.</p>
}
else
{
<table class="table table-sm table-hover">
<thead>
<tr><th>LDAP group</th><th>Role</th><th>Scope</th><th>Created</th><th>Notes</th><th></th></tr>
</thead>
<tbody>
@foreach (var r in _rows)
{
<tr>
<td><code>@r.LdapGroup</code></td>
<td><span class="badge bg-secondary">@r.Role</span></td>
<td>@(r.IsSystemWide ? "Fleet-wide" : $"Cluster: {r.ClusterId}")</td>
<td class="small">@r.CreatedAtUtc.ToString("yyyy-MM-dd")</td>
<td class="small text-muted">@r.Notes</td>
<td><button class="btn btn-sm btn-outline-danger" @onclick="() => DeleteAsync(r.Id)">Revoke</button></td>
</tr>
}
</tbody>
</table>
}
@if (_showForm)
{
<div class="card mt-3">
<div class="card-body">
<h5>New role grant</h5>
<div class="row g-3">
<div class="col-md-4">
<label class="form-label">LDAP group (DN)</label>
<input class="form-control" @bind="_group" placeholder="cn=fleet-admin,ou=groups,dc=…"/>
</div>
<div class="col-md-3">
<label class="form-label">Role</label>
<select class="form-select" @bind="_role">
@foreach (var r in Enum.GetValues<AdminRole>())
{
<option value="@r">@r</option>
}
</select>
</div>
<div class="col-md-2 pt-4">
<div class="form-check">
<input class="form-check-input" type="checkbox" id="systemWide" @bind="_isSystemWide"/>
<label class="form-check-label" for="systemWide">Fleet-wide</label>
</div>
</div>
<div class="col-md-3">
<label class="form-label">Cluster @(_isSystemWide ? "(disabled)" : "")</label>
<select class="form-select" @bind="_clusterId" disabled="@_isSystemWide">
<option value="">-- select --</option>
@if (_clusters is not null)
{
@foreach (var c in _clusters)
{
<option value="@c.ClusterId">@c.ClusterId</option>
}
}
</select>
</div>
<div class="col-12">
<label class="form-label">Notes (optional)</label>
<input class="form-control" @bind="_notes"/>
</div>
</div>
@if (_error is not null) { <div class="alert alert-danger mt-3">@_error</div> }
<div class="mt-3">
<button class="btn btn-sm btn-primary" @onclick="SaveAsync">Save</button>
<button class="btn btn-sm btn-secondary ms-2" @onclick="() => _showForm = false">Cancel</button>
</div>
</div>
</div>
}
@code {
private IReadOnlyList<LdapGroupRoleMapping>? _rows;
private List<ServerCluster>? _clusters;
private bool _showForm;
private string _group = string.Empty;
private AdminRole _role = AdminRole.ConfigViewer;
private bool _isSystemWide;
private string _clusterId = string.Empty;
private string? _notes;
private string? _error;
protected override async Task OnInitializedAsync() => await ReloadAsync();
private async Task ReloadAsync()
{
_rows = await RoleSvc.ListAllAsync(CancellationToken.None);
_clusters = await ClusterSvc.ListAsync(CancellationToken.None);
}
private void StartAdd()
{
_group = string.Empty;
_role = AdminRole.ConfigViewer;
_isSystemWide = false;
_clusterId = string.Empty;
_notes = null;
_error = null;
_showForm = true;
}
private async Task SaveAsync()
{
_error = null;
try
{
var row = new LdapGroupRoleMapping
{
LdapGroup = _group.Trim(),
Role = _role,
IsSystemWide = _isSystemWide,
ClusterId = _isSystemWide ? null : (string.IsNullOrWhiteSpace(_clusterId) ? null : _clusterId),
Notes = string.IsNullOrWhiteSpace(_notes) ? null : _notes,
};
await RoleSvc.CreateAsync(row, CancellationToken.None);
await Notifier.NotifyRoleGrantsChangedAsync(CancellationToken.None);
_showForm = false;
await ReloadAsync();
}
catch (Exception ex) { _error = ex.Message; }
}
private async Task DeleteAsync(Guid id)
{
await RoleSvc.DeleteAsync(id, CancellationToken.None);
await Notifier.NotifyRoleGrantsChangedAsync(CancellationToken.None);
await ReloadAsync();
}
private HubConnection? _hub;
protected override async Task OnAfterRenderAsync(bool firstRender)
{
if (!firstRender || _hub is not null) return;
_hub = new HubConnectionBuilder()
.WithUrl(Nav.ToAbsoluteUri("/hubs/fleet-status"))
.WithAutomaticReconnect()
.Build();
_hub.On<RoleGrantsChangedMessage>("RoleGrantsChanged", async _ =>
{
await ReloadAsync();
await InvokeAsync(StateHasChanged);
});
await _hub.StartAsync();
await _hub.SendAsync("SubscribeFleet");
}
public async ValueTask DisposeAsync()
{
if (_hub is not null) { await _hub.DisposeAsync(); _hub = null; }
}
}

View File

@@ -1,7 +1,9 @@
using Microsoft.AspNetCore.SignalR;
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Admin.Services;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Hubs;
@@ -14,11 +16,13 @@ public sealed class FleetStatusPoller(
IServiceScopeFactory scopeFactory,
IHubContext<FleetStatusHub> fleetHub,
IHubContext<AlertHub> alertHub,
ILogger<FleetStatusPoller> logger) : BackgroundService
ILogger<FleetStatusPoller> logger,
RedundancyMetrics redundancyMetrics) : BackgroundService
{
public TimeSpan PollInterval { get; init; } = TimeSpan.FromSeconds(5);
private readonly Dictionary<string, NodeStateSnapshot> _last = new();
private readonly Dictionary<string, RedundancyRole> _lastRole = new(StringComparer.Ordinal);
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
@@ -42,6 +46,10 @@ public sealed class FleetStatusPoller(
using var scope = scopeFactory.CreateScope();
var db = scope.ServiceProvider.GetRequiredService<OtOpcUaConfigDbContext>();
var nodes = await db.ClusterNodes.AsNoTracking().ToListAsync(ct);
await PollRolesAsync(nodes, ct);
UpdateClusterGauges(nodes);
var rows = await db.ClusterNodeGenerationStates.AsNoTracking()
.Join(db.ClusterNodes.AsNoTracking(), s => s.NodeId, n => n.NodeId, (s, n) => new { s, n.ClusterId })
.ToListAsync(ct);
@@ -85,9 +93,63 @@ public sealed class FleetStatusPoller(
}
/// <summary>Exposed for tests — forces a snapshot reset so stub data re-seeds.</summary>
internal void ResetCache() => _last.Clear();
internal void ResetCache()
{
_last.Clear();
_lastRole.Clear();
}
private async Task PollRolesAsync(IReadOnlyList<ClusterNode> nodes, CancellationToken ct)
{
foreach (var n in nodes)
{
var hadPrior = _lastRole.TryGetValue(n.NodeId, out var priorRole);
if (hadPrior && priorRole == n.RedundancyRole) continue;
_lastRole[n.NodeId] = n.RedundancyRole;
if (!hadPrior) continue; // first-observation bootstrap — not a transition
redundancyMetrics.RecordRoleTransition(
clusterId: n.ClusterId, nodeId: n.NodeId,
fromRole: priorRole.ToString(), toRole: n.RedundancyRole.ToString());
var msg = new RoleChangedMessage(
ClusterId: n.ClusterId, NodeId: n.NodeId,
FromRole: priorRole.ToString(), ToRole: n.RedundancyRole.ToString(),
ObservedAtUtc: DateTime.UtcNow);
await fleetHub.Clients.Group(FleetStatusHub.GroupName(n.ClusterId))
.SendAsync("RoleChanged", msg, ct);
await fleetHub.Clients.Group(FleetStatusHub.FleetGroup)
.SendAsync("RoleChanged", msg, ct);
}
}
private void UpdateClusterGauges(IReadOnlyList<ClusterNode> nodes)
{
var staleCutoff = DateTime.UtcNow - Services.ClusterNodeService.StaleThreshold;
foreach (var group in nodes.GroupBy(n => n.ClusterId))
{
var primary = group.Count(n => n.RedundancyRole == RedundancyRole.Primary);
var secondary = group.Count(n => n.RedundancyRole == RedundancyRole.Secondary);
var stale = group.Count(n => n.LastSeenAt is null || n.LastSeenAt.Value < staleCutoff);
redundancyMetrics.SetClusterCounts(group.Key, primary, secondary, stale);
}
}
private readonly record struct NodeStateSnapshot(
string NodeId, string ClusterId, long? GenerationId,
string? Status, string? Error, DateTime? AppliedAt, DateTime? SeenAt);
}
/// <summary>
/// Pushed by <see cref="FleetStatusPoller"/> when it observes a change in
/// <see cref="ClusterNode.RedundancyRole"/>. Consumed by the Admin RedundancyTab to trigger
/// an instant reload instead of waiting for the next on-parameter-set poll.
/// </summary>
public sealed record RoleChangedMessage(
string ClusterId,
string NodeId,
string FromRole,
string ToRole,
DateTime ObservedAtUtc);

View File

@@ -1,6 +1,7 @@
using Microsoft.AspNetCore.Authentication;
using Microsoft.AspNetCore.Authentication.Cookies;
using Microsoft.EntityFrameworkCore;
using OpenTelemetry.Metrics;
using Serilog;
using ZB.MOM.WW.OtOpcUa.Admin.Components;
using ZB.MOM.WW.OtOpcUa.Admin.Hubs;
@@ -44,10 +45,17 @@ builder.Services.AddScoped<UnsService>();
builder.Services.AddScoped<NamespaceService>();
builder.Services.AddScoped<DriverInstanceService>();
builder.Services.AddScoped<NodeAclService>();
builder.Services.AddScoped<PermissionProbeService>();
builder.Services.AddScoped<AclChangeNotifier>();
builder.Services.AddScoped<ReservationService>();
builder.Services.AddScoped<DraftValidationService>();
builder.Services.AddScoped<AuditLogService>();
builder.Services.AddScoped<HostStatusService>();
builder.Services.AddScoped<ClusterNodeService>();
builder.Services.AddSingleton<RedundancyMetrics>();
builder.Services.AddScoped<EquipmentImportBatchService>();
builder.Services.AddScoped<ZB.MOM.WW.OtOpcUa.Configuration.Services.ILdapGroupRoleMappingService,
ZB.MOM.WW.OtOpcUa.Configuration.Services.LdapGroupRoleMappingService>();
// Cert-trust management — reads the OPC UA server's PKI store root so rejected client certs
// can be promoted to trusted via the Admin UI. Singleton: no per-request state, just
@@ -63,6 +71,19 @@ builder.Services.AddScoped<ILdapAuthService, LdapAuthService>();
// SignalR real-time fleet status + alerts (admin-ui.md §"Real-Time Updates").
builder.Services.AddHostedService<FleetStatusPoller>();
// OpenTelemetry Prometheus exporter — Meter stream from RedundancyMetrics + any future
// Admin-side instrumentation lands on the /metrics endpoint Prometheus scrapes. Pull-based
// means no OTel Collector deployment required for the common deploy-in-a-K8s case; appsettings
// Metrics:Prometheus:Enabled=false disables the endpoint entirely for locked-down deployments.
var metricsEnabled = builder.Configuration.GetValue("Metrics:Prometheus:Enabled", true);
if (metricsEnabled)
{
builder.Services.AddOpenTelemetry()
.WithMetrics(m => m
.AddMeter(RedundancyMetrics.MeterName)
.AddPrometheusExporter());
}
var app = builder.Build();
app.UseSerilogRequestLogging();
@@ -80,6 +101,15 @@ app.MapPost("/auth/logout", async (HttpContext ctx) =>
app.MapHub<FleetStatusHub>("/hubs/fleet");
app.MapHub<AlertHub>("/hubs/alerts");
if (metricsEnabled)
{
// Prometheus scrape endpoint — expose instrumentation registered in the OTel MeterProvider
// above. Emits text-format metrics at /metrics; auth is intentionally NOT required (Prometheus
// scrape jobs typically run on a trusted network). Operators who need auth put the endpoint
// behind a reverse-proxy basic-auth gate per fleet-ops convention.
app.MapPrometheusScrapingEndpoint();
}
app.MapRazorComponents<App>().AddInteractiveServerRenderMode();
await app.RunAsync();

View File

@@ -0,0 +1,49 @@
using Microsoft.AspNetCore.SignalR;
using ZB.MOM.WW.OtOpcUa.Admin.Hubs;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Thin SignalR push helper for ACL + role-grant invalidation — slice 2 of task #196.
/// Lets the Admin services + razor pages invalidate connected peers' views without each
/// one having to know the hub wiring. Two message kinds: <c>NodeAclChanged</c> (cluster-scoped)
/// and <c>RoleGrantsChanged</c> (fleet-wide — role mappings cross cluster boundaries).
/// </summary>
/// <remarks>
/// Intentionally fire-and-forget — a failed hub send doesn't rollback the DB write that
/// triggered it. Worst-case an operator sees stale data until their next poll or manual
/// refresh; better than a transient hub blip blocking the authoritative write path.
/// </remarks>
public sealed class AclChangeNotifier(IHubContext<FleetStatusHub> fleetHub, ILogger<AclChangeNotifier> logger)
{
public async Task NotifyNodeAclChangedAsync(string clusterId, long generationId, CancellationToken ct)
{
try
{
var msg = new NodeAclChangedMessage(ClusterId: clusterId, GenerationId: generationId, ObservedAtUtc: DateTime.UtcNow);
await fleetHub.Clients.Group(FleetStatusHub.GroupName(clusterId))
.SendAsync("NodeAclChanged", msg, ct).ConfigureAwait(false);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
logger.LogWarning(ex, "NodeAclChanged push failed for cluster {ClusterId} gen {GenerationId}", clusterId, generationId);
}
}
public async Task NotifyRoleGrantsChangedAsync(CancellationToken ct)
{
try
{
var msg = new RoleGrantsChangedMessage(ObservedAtUtc: DateTime.UtcNow);
await fleetHub.Clients.Group(FleetStatusHub.FleetGroup)
.SendAsync("RoleGrantsChanged", msg, ct).ConfigureAwait(false);
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
logger.LogWarning(ex, "RoleGrantsChanged push failed");
}
}
}
public sealed record NodeAclChangedMessage(string ClusterId, long GenerationId, DateTime ObservedAtUtc);
public sealed record RoleGrantsChangedMessage(DateTime ObservedAtUtc);

View File

@@ -0,0 +1,28 @@
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Read-side service for ClusterNode rows + their cluster-scoped redundancy view. Consumed
/// by the RedundancyTab on the cluster detail page. Writes (role swap, node enable/disable)
/// are not supported here — role swap happens through the RedundancyCoordinator apply-lease
/// flow on the server side and would conflict with any direct DB mutation from Admin.
/// </summary>
public sealed class ClusterNodeService(OtOpcUaConfigDbContext db)
{
/// <summary>Stale-threshold matching <c>HostStatusService.StaleThreshold</c> — 30s of clock
/// tolerance covers a missed heartbeat plus publisher GC pauses.</summary>
public static readonly TimeSpan StaleThreshold = TimeSpan.FromSeconds(30);
public Task<List<ClusterNode>> ListByClusterAsync(string clusterId, CancellationToken ct) =>
db.ClusterNodes.AsNoTracking()
.Where(n => n.ClusterId == clusterId)
.OrderByDescending(n => n.ServiceLevelBase)
.ThenBy(n => n.NodeId)
.ToListAsync(ct);
public static bool IsStale(ClusterNode node) =>
node.LastSeenAt is null || DateTime.UtcNow - node.LastSeenAt.Value > StaleThreshold;
}

View File

@@ -0,0 +1,259 @@
using System.Globalization;
using System.Text;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// RFC 4180 CSV parser for equipment import per decision #95 and Phase 6.4 Stream B.1.
/// Produces a validated <see cref="EquipmentCsvParseResult"/> the caller (CSV import
/// modal + staging tables) consumes. Pure-parser concern — no DB access, no staging
/// writes; those live in the follow-up Stream B.2 work.
/// </summary>
/// <remarks>
/// <para><b>Header contract</b>: line 1 must be exactly <c># OtOpcUaCsv v1</c> (version
/// marker). Line 2 is the column header row. Unknown columns are rejected; required
/// columns must all be present. The version bump handshake lets future shapes parse
/// without ambiguity — v2 files go through a different parser variant.</para>
///
/// <para><b>Required columns</b> per decision #117: ZTag, MachineCode, SAPID,
/// EquipmentId, EquipmentUuid, Name, UnsAreaName, UnsLineName.</para>
///
/// <para><b>Optional columns</b> per decision #139: Manufacturer, Model, SerialNumber,
/// HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation,
/// ManufacturerUri, DeviceManualUri.</para>
///
/// <para><b>Row validation</b>: blank required field → rejected; duplicate ZTag within
/// the same file → rejected. Duplicate against the DB isn't detected here — the
/// staged-import finalize step (Stream B.4) catches that.</para>
/// </remarks>
public static class EquipmentCsvImporter
{
public const string VersionMarker = "# OtOpcUaCsv v1";
public static IReadOnlyList<string> RequiredColumns { get; } = new[]
{
"ZTag", "MachineCode", "SAPID", "EquipmentId", "EquipmentUuid",
"Name", "UnsAreaName", "UnsLineName",
};
public static IReadOnlyList<string> OptionalColumns { get; } = new[]
{
"Manufacturer", "Model", "SerialNumber", "HardwareRevision", "SoftwareRevision",
"YearOfConstruction", "AssetLocation", "ManufacturerUri", "DeviceManualUri",
};
public static EquipmentCsvParseResult Parse(string csvText)
{
ArgumentNullException.ThrowIfNull(csvText);
var rows = SplitLines(csvText);
if (rows.Count == 0)
throw new InvalidCsvFormatException("CSV is empty.");
if (!string.Equals(rows[0].Trim(), VersionMarker, StringComparison.Ordinal))
throw new InvalidCsvFormatException(
$"CSV header line 1 must be exactly '{VersionMarker}' — got '{rows[0]}'. " +
"Files without the version marker are rejected so future-format files don't parse ambiguously.");
if (rows.Count < 2)
throw new InvalidCsvFormatException("CSV has no column header row (line 2) or data rows.");
var headerCells = SplitCsvRow(rows[1]);
ValidateHeader(headerCells);
var accepted = new List<EquipmentCsvRow>();
var rejected = new List<EquipmentCsvRowError>();
var ztagsSeen = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
var colIndex = headerCells
.Select((name, idx) => (name, idx))
.ToDictionary(t => t.name, t => t.idx, StringComparer.OrdinalIgnoreCase);
for (var i = 2; i < rows.Count; i++)
{
if (string.IsNullOrWhiteSpace(rows[i])) continue;
try
{
var cells = SplitCsvRow(rows[i]);
if (cells.Length != headerCells.Length)
{
rejected.Add(new EquipmentCsvRowError(
LineNumber: i + 1,
Reason: $"Column count {cells.Length} != header count {headerCells.Length}."));
continue;
}
var row = BuildRow(cells, colIndex);
var missing = RequiredColumns.Where(c => string.IsNullOrWhiteSpace(GetCell(row, c))).ToList();
if (missing.Count > 0)
{
rejected.Add(new EquipmentCsvRowError(i + 1, $"Blank required column(s): {string.Join(", ", missing)}"));
continue;
}
if (!ztagsSeen.Add(row.ZTag))
{
rejected.Add(new EquipmentCsvRowError(i + 1, $"Duplicate ZTag '{row.ZTag}' within file."));
continue;
}
accepted.Add(row);
}
catch (InvalidCsvFormatException ex)
{
rejected.Add(new EquipmentCsvRowError(i + 1, ex.Message));
}
}
return new EquipmentCsvParseResult(accepted, rejected);
}
private static void ValidateHeader(string[] headerCells)
{
var seen = new HashSet<string>(headerCells, StringComparer.OrdinalIgnoreCase);
// Missing required
var missingRequired = RequiredColumns.Where(r => !seen.Contains(r)).ToList();
if (missingRequired.Count > 0)
throw new InvalidCsvFormatException($"Header is missing required column(s): {string.Join(", ", missingRequired)}");
// Unknown columns (not in required optional)
var known = new HashSet<string>(RequiredColumns.Concat(OptionalColumns), StringComparer.OrdinalIgnoreCase);
var unknown = headerCells.Where(c => !known.Contains(c)).ToList();
if (unknown.Count > 0)
throw new InvalidCsvFormatException(
$"Header has unknown column(s): {string.Join(", ", unknown)}. " +
"Bump the version marker to define a new shape before adding columns.");
// Duplicates
var dupe = headerCells.GroupBy(c => c, StringComparer.OrdinalIgnoreCase).FirstOrDefault(g => g.Count() > 1);
if (dupe is not null)
throw new InvalidCsvFormatException($"Header has duplicate column '{dupe.Key}'.");
}
private static EquipmentCsvRow BuildRow(string[] cells, Dictionary<string, int> colIndex) => new()
{
ZTag = cells[colIndex["ZTag"]],
MachineCode = cells[colIndex["MachineCode"]],
SAPID = cells[colIndex["SAPID"]],
EquipmentId = cells[colIndex["EquipmentId"]],
EquipmentUuid = cells[colIndex["EquipmentUuid"]],
Name = cells[colIndex["Name"]],
UnsAreaName = cells[colIndex["UnsAreaName"]],
UnsLineName = cells[colIndex["UnsLineName"]],
Manufacturer = colIndex.TryGetValue("Manufacturer", out var mi) ? cells[mi] : null,
Model = colIndex.TryGetValue("Model", out var moi) ? cells[moi] : null,
SerialNumber = colIndex.TryGetValue("SerialNumber", out var si) ? cells[si] : null,
HardwareRevision = colIndex.TryGetValue("HardwareRevision", out var hi) ? cells[hi] : null,
SoftwareRevision = colIndex.TryGetValue("SoftwareRevision", out var swi) ? cells[swi] : null,
YearOfConstruction = colIndex.TryGetValue("YearOfConstruction", out var yi) ? cells[yi] : null,
AssetLocation = colIndex.TryGetValue("AssetLocation", out var ai) ? cells[ai] : null,
ManufacturerUri = colIndex.TryGetValue("ManufacturerUri", out var mui) ? cells[mui] : null,
DeviceManualUri = colIndex.TryGetValue("DeviceManualUri", out var dui) ? cells[dui] : null,
};
private static string GetCell(EquipmentCsvRow row, string colName) => colName switch
{
"ZTag" => row.ZTag,
"MachineCode" => row.MachineCode,
"SAPID" => row.SAPID,
"EquipmentId" => row.EquipmentId,
"EquipmentUuid" => row.EquipmentUuid,
"Name" => row.Name,
"UnsAreaName" => row.UnsAreaName,
"UnsLineName" => row.UnsLineName,
_ => string.Empty,
};
/// <summary>Split the raw text on line boundaries. Handles \r\n + \n + \r.</summary>
private static List<string> SplitLines(string csv) =>
csv.Split(["\r\n", "\n", "\r"], StringSplitOptions.None).ToList();
/// <summary>Split one CSV row with RFC 4180 quoted-field handling.</summary>
private static string[] SplitCsvRow(string row)
{
var cells = new List<string>();
var sb = new StringBuilder();
var inQuotes = false;
for (var i = 0; i < row.Length; i++)
{
var ch = row[i];
if (inQuotes)
{
if (ch == '"')
{
// Escaped quote "" inside quoted field.
if (i + 1 < row.Length && row[i + 1] == '"')
{
sb.Append('"');
i++;
}
else
{
inQuotes = false;
}
}
else
{
sb.Append(ch);
}
}
else
{
if (ch == ',')
{
cells.Add(sb.ToString());
sb.Clear();
}
else if (ch == '"' && sb.Length == 0)
{
inQuotes = true;
}
else
{
sb.Append(ch);
}
}
}
cells.Add(sb.ToString());
return cells.ToArray();
}
}
/// <summary>One parsed equipment row with required + optional fields.</summary>
public sealed class EquipmentCsvRow
{
// Required (decision #117)
public required string ZTag { get; init; }
public required string MachineCode { get; init; }
public required string SAPID { get; init; }
public required string EquipmentId { get; init; }
public required string EquipmentUuid { get; init; }
public required string Name { get; init; }
public required string UnsAreaName { get; init; }
public required string UnsLineName { get; init; }
// Optional (decision #139 — OPC 40010 Identification fields)
public string? Manufacturer { get; init; }
public string? Model { get; init; }
public string? SerialNumber { get; init; }
public string? HardwareRevision { get; init; }
public string? SoftwareRevision { get; init; }
public string? YearOfConstruction { get; init; }
public string? AssetLocation { get; init; }
public string? ManufacturerUri { get; init; }
public string? DeviceManualUri { get; init; }
}
/// <summary>One row-level rejection captured by the parser. Line-number is 1-based in the source file.</summary>
public sealed record EquipmentCsvRowError(int LineNumber, string Reason);
/// <summary>Parser output — accepted rows land in staging; rejected rows surface in the preview modal.</summary>
public sealed record EquipmentCsvParseResult(
IReadOnlyList<EquipmentCsvRow> AcceptedRows,
IReadOnlyList<EquipmentCsvRowError> RejectedRows);
/// <summary>Thrown for file-level format problems (missing version marker, bad header, etc.).</summary>
public sealed class InvalidCsvFormatException(string message) : Exception(message);

View File

@@ -0,0 +1,324 @@
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Admin.Services;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Staged-import orchestrator per Phase 6.4 Stream B.2-B.4. Covers the four operator
/// actions: CreateBatch → StageRows (chunked) → FinaliseBatch (atomic apply into
/// <see cref="Equipment"/>) → DropBatch (rollback of pre-finalise state).
/// </summary>
/// <remarks>
/// <para>FinaliseBatch runs inside one EF transaction + bulk-inserts accepted rows into
/// <see cref="Equipment"/>. Rejected rows stay behind as audit evidence; the batch row
/// gains <see cref="EquipmentImportBatch.FinalisedAtUtc"/> so future writes know it's
/// archived. DropBatch removes the batch + its cascaded rows.</para>
///
/// <para>Idempotence: calling FinaliseBatch twice throws <see cref="ImportBatchAlreadyFinalisedException"/>
/// rather than double-inserting. Operator refreshes the admin page to see the first
/// finalise completed.</para>
///
/// <para>ExternalIdReservation merging (ZTag + SAPID uniqueness) is NOT done here — a
/// narrower follow-up wires it once the concurrent-insert test matrix is green.</para>
/// </remarks>
public sealed class EquipmentImportBatchService(OtOpcUaConfigDbContext db)
{
/// <summary>Create a new empty batch header. Returns the row with Id populated.</summary>
public async Task<EquipmentImportBatch> CreateBatchAsync(string clusterId, string createdBy, CancellationToken ct)
{
ArgumentException.ThrowIfNullOrWhiteSpace(clusterId);
ArgumentException.ThrowIfNullOrWhiteSpace(createdBy);
var batch = new EquipmentImportBatch
{
Id = Guid.NewGuid(),
ClusterId = clusterId,
CreatedBy = createdBy,
CreatedAtUtc = DateTime.UtcNow,
};
db.EquipmentImportBatches.Add(batch);
await db.SaveChangesAsync(ct).ConfigureAwait(false);
return batch;
}
/// <summary>
/// Stage one chunk of rows into the batch. Caller usually feeds
/// <see cref="EquipmentCsvImporter.Parse"/> output here — each
/// <see cref="EquipmentCsvRow"/> becomes one accepted <see cref="EquipmentImportRow"/>,
/// each rejected parser error becomes one row with <see cref="EquipmentImportRow.IsAccepted"/> false.
/// </summary>
public async Task StageRowsAsync(
Guid batchId,
IReadOnlyList<EquipmentCsvRow> acceptedRows,
IReadOnlyList<EquipmentCsvRowError> rejectedRows,
CancellationToken ct)
{
var batch = await db.EquipmentImportBatches.FirstOrDefaultAsync(b => b.Id == batchId, ct).ConfigureAwait(false)
?? throw new ImportBatchNotFoundException($"Batch {batchId} not found.");
if (batch.FinalisedAtUtc is not null)
throw new ImportBatchAlreadyFinalisedException(
$"Batch {batchId} finalised at {batch.FinalisedAtUtc:o}; no more rows can be staged.");
foreach (var row in acceptedRows)
{
db.EquipmentImportRows.Add(new EquipmentImportRow
{
Id = Guid.NewGuid(),
BatchId = batchId,
IsAccepted = true,
ZTag = row.ZTag,
MachineCode = row.MachineCode,
SAPID = row.SAPID,
EquipmentId = row.EquipmentId,
EquipmentUuid = row.EquipmentUuid,
Name = row.Name,
UnsAreaName = row.UnsAreaName,
UnsLineName = row.UnsLineName,
Manufacturer = row.Manufacturer,
Model = row.Model,
SerialNumber = row.SerialNumber,
HardwareRevision = row.HardwareRevision,
SoftwareRevision = row.SoftwareRevision,
YearOfConstruction = row.YearOfConstruction,
AssetLocation = row.AssetLocation,
ManufacturerUri = row.ManufacturerUri,
DeviceManualUri = row.DeviceManualUri,
});
}
foreach (var error in rejectedRows)
{
db.EquipmentImportRows.Add(new EquipmentImportRow
{
Id = Guid.NewGuid(),
BatchId = batchId,
IsAccepted = false,
RejectReason = error.Reason,
LineNumberInFile = error.LineNumber,
// Required columns need values for EF; reject rows use sentinel placeholders.
ZTag = "", MachineCode = "", SAPID = "", EquipmentId = "", EquipmentUuid = "",
Name = "", UnsAreaName = "", UnsLineName = "",
});
}
batch.RowsStaged += acceptedRows.Count + rejectedRows.Count;
batch.RowsAccepted += acceptedRows.Count;
batch.RowsRejected += rejectedRows.Count;
await db.SaveChangesAsync(ct).ConfigureAwait(false);
}
/// <summary>Drop the batch (pre-finalise rollback). Cascaded row delete removes staged rows.</summary>
public async Task DropBatchAsync(Guid batchId, CancellationToken ct)
{
var batch = await db.EquipmentImportBatches.FirstOrDefaultAsync(b => b.Id == batchId, ct).ConfigureAwait(false);
if (batch is null) return;
if (batch.FinalisedAtUtc is not null)
throw new ImportBatchAlreadyFinalisedException(
$"Batch {batchId} already finalised at {batch.FinalisedAtUtc:o}; cannot drop.");
db.EquipmentImportBatches.Remove(batch);
await db.SaveChangesAsync(ct).ConfigureAwait(false);
}
/// <summary>
/// Atomic finalise. Inserts every accepted row into the live
/// <see cref="Equipment"/> table under the target generation + stamps
/// <see cref="EquipmentImportBatch.FinalisedAtUtc"/>. Failure rolls the whole tx
/// back — <see cref="Equipment"/> never partially mutates.
/// </summary>
public async Task FinaliseBatchAsync(
Guid batchId, long generationId, string driverInstanceIdForRows, string unsLineIdForRows, CancellationToken ct)
{
var batch = await db.EquipmentImportBatches
.Include(b => b.Rows)
.FirstOrDefaultAsync(b => b.Id == batchId, ct)
.ConfigureAwait(false)
?? throw new ImportBatchNotFoundException($"Batch {batchId} not found.");
if (batch.FinalisedAtUtc is not null)
throw new ImportBatchAlreadyFinalisedException(
$"Batch {batchId} already finalised at {batch.FinalisedAtUtc:o}.");
// EF InMemory provider doesn't honour BeginTransaction; SQL Server provider does.
// Tests run the happy path under in-memory; production SQL Server runs the atomic tx.
var supportsTx = db.Database.IsRelational();
Microsoft.EntityFrameworkCore.Storage.IDbContextTransaction? tx = null;
if (supportsTx)
tx = await db.Database.BeginTransactionAsync(ct).ConfigureAwait(false);
try
{
// Snapshot active reservations that overlap this batch's ZTag + SAPID set — one
// round-trip instead of N. Released rows (ReleasedAt IS NOT NULL) are ignored so
// an explicitly-released value can be reused.
var accepted = batch.Rows.Where(r => r.IsAccepted).ToList();
var zTags = accepted.Where(r => !string.IsNullOrWhiteSpace(r.ZTag))
.Select(r => r.ZTag).Distinct(StringComparer.OrdinalIgnoreCase).ToList();
var sapIds = accepted.Where(r => !string.IsNullOrWhiteSpace(r.SAPID))
.Select(r => r.SAPID).Distinct(StringComparer.OrdinalIgnoreCase).ToList();
var existingReservations = await db.ExternalIdReservations
.Where(r => r.ReleasedAt == null &&
((r.Kind == ReservationKind.ZTag && zTags.Contains(r.Value)) ||
(r.Kind == ReservationKind.SAPID && sapIds.Contains(r.Value))))
.ToListAsync(ct).ConfigureAwait(false);
var resByKey = existingReservations.ToDictionary(
r => (r.Kind, r.Value.ToLowerInvariant()),
r => r);
var nowUtc = DateTime.UtcNow;
var firstPublishedBy = batch.CreatedBy;
foreach (var row in accepted)
{
var equipmentUuid = Guid.TryParse(row.EquipmentUuid, out var u) ? u : Guid.NewGuid();
db.Equipment.Add(new Equipment
{
EquipmentRowId = Guid.NewGuid(),
GenerationId = generationId,
EquipmentId = row.EquipmentId,
EquipmentUuid = equipmentUuid,
DriverInstanceId = driverInstanceIdForRows,
UnsLineId = unsLineIdForRows,
Name = row.Name,
MachineCode = row.MachineCode,
ZTag = row.ZTag,
SAPID = row.SAPID,
Manufacturer = row.Manufacturer,
Model = row.Model,
SerialNumber = row.SerialNumber,
HardwareRevision = row.HardwareRevision,
SoftwareRevision = row.SoftwareRevision,
YearOfConstruction = short.TryParse(row.YearOfConstruction, out var y) ? y : null,
AssetLocation = row.AssetLocation,
ManufacturerUri = row.ManufacturerUri,
DeviceManualUri = row.DeviceManualUri,
});
MergeReservation(row.ZTag, ReservationKind.ZTag, equipmentUuid, batch.ClusterId,
firstPublishedBy, nowUtc, resByKey);
MergeReservation(row.SAPID, ReservationKind.SAPID, equipmentUuid, batch.ClusterId,
firstPublishedBy, nowUtc, resByKey);
}
batch.FinalisedAtUtc = nowUtc;
try
{
await db.SaveChangesAsync(ct).ConfigureAwait(false);
}
catch (DbUpdateException ex) when (IsReservationUniquenessViolation(ex))
{
throw new ExternalIdReservationConflictException(
"Finalise rejected: one or more ZTag/SAPID values were reserved by another operator " +
"between batch preview and commit. Inspect active reservations + retry after resolving the conflict.",
ex);
}
if (tx is not null) await tx.CommitAsync(ct).ConfigureAwait(false);
}
catch
{
if (tx is not null) await tx.RollbackAsync(ct).ConfigureAwait(false);
throw;
}
finally
{
if (tx is not null) await tx.DisposeAsync().ConfigureAwait(false);
}
}
/// <summary>
/// Merge one external-ID reservation for an equipment row. Three outcomes:
/// (1) value is empty → skip; (2) reservation exists for same <paramref name="equipmentUuid"/>
/// → bump <c>LastPublishedAt</c>; (3) reservation exists for a different EquipmentUuid
/// → throw <see cref="ExternalIdReservationConflictException"/> with the conflicting UUID
/// so the caller sees which equipment already owns the value; (4) no reservation → create new.
/// </summary>
private void MergeReservation(
string? value,
ReservationKind kind,
Guid equipmentUuid,
string clusterId,
string firstPublishedBy,
DateTime nowUtc,
Dictionary<(ReservationKind, string), ExternalIdReservation> cache)
{
if (string.IsNullOrWhiteSpace(value)) return;
var key = (kind, value.ToLowerInvariant());
if (cache.TryGetValue(key, out var existing))
{
if (existing.EquipmentUuid != equipmentUuid)
throw new ExternalIdReservationConflictException(
$"{kind} '{value}' is already reserved by EquipmentUuid {existing.EquipmentUuid} " +
$"(first published {existing.FirstPublishedAt:u} on cluster '{existing.ClusterId}'). " +
$"Refusing to re-assign to {equipmentUuid}.");
existing.LastPublishedAt = nowUtc;
return;
}
var fresh = new ExternalIdReservation
{
ReservationId = Guid.NewGuid(),
Kind = kind,
Value = value,
EquipmentUuid = equipmentUuid,
ClusterId = clusterId,
FirstPublishedAt = nowUtc,
FirstPublishedBy = firstPublishedBy,
LastPublishedAt = nowUtc,
};
db.ExternalIdReservations.Add(fresh);
cache[key] = fresh;
}
/// <summary>
/// True when the <see cref="DbUpdateException"/> root-cause was the filtered-unique
/// index <c>UX_ExternalIdReservation_KindValue_Active</c> — i.e. another transaction
/// won the race between our cache-load + commit. SQL Server surfaces this as 2601 / 2627.
/// </summary>
private static bool IsReservationUniquenessViolation(DbUpdateException ex)
{
for (Exception? inner = ex; inner is not null; inner = inner.InnerException)
{
if (inner is Microsoft.Data.SqlClient.SqlException sql &&
(sql.Number == 2601 || sql.Number == 2627) &&
sql.Message.Contains("UX_ExternalIdReservation_KindValue_Active", StringComparison.OrdinalIgnoreCase))
{
return true;
}
}
return false;
}
/// <summary>List batches created by the given user. Finalised batches are archived; include them on demand.</summary>
public async Task<IReadOnlyList<EquipmentImportBatch>> ListByUserAsync(string createdBy, bool includeFinalised, CancellationToken ct)
{
var query = db.EquipmentImportBatches.AsNoTracking().Where(b => b.CreatedBy == createdBy);
if (!includeFinalised)
query = query.Where(b => b.FinalisedAtUtc == null);
return await query.OrderByDescending(b => b.CreatedAtUtc).ToListAsync(ct).ConfigureAwait(false);
}
}
public sealed class ImportBatchNotFoundException(string message) : Exception(message);
public sealed class ImportBatchAlreadyFinalisedException(string message) : Exception(message);
/// <summary>
/// Thrown when a <c>FinaliseBatchAsync</c> call detects that one of its ZTag/SAPID values is
/// already reserved by a different EquipmentUuid — either from a prior published generation
/// or a concurrent finalise that won the race. The operator sees the message + the conflicting
/// equipment ownership so they can resolve the conflict (pick a new ZTag, release the existing
/// reservation via <c>sp_ReleaseExternalIdReservation</c>, etc.) and retry the finalise.
/// </summary>
public sealed class ExternalIdReservationConflictException : Exception
{
public ExternalIdReservationConflictException(string message) : base(message) { }
public ExternalIdReservationConflictException(string message, Exception inner) : base(message, inner) { }
}

View File

@@ -7,8 +7,9 @@ namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// One row per <see cref="DriverHostStatus"/> record, enriched with the owning
/// <c>ClusterNode.ClusterId</c> when available (left-join). The Admin <c>/hosts</c> page
/// groups by cluster and renders a per-node → per-driver → per-host tree.
/// <c>ClusterNode.ClusterId</c> (left-join) + the per-<c>(DriverInstanceId, HostName)</c>
/// <see cref="DriverInstanceResilienceStatus"/> counters (also left-join) so the Admin
/// <c>/hosts</c> page renders the resilience surface inline with host state.
/// </summary>
public sealed record HostStatusRow(
string NodeId,
@@ -18,7 +19,11 @@ public sealed record HostStatusRow(
DriverHostState State,
DateTime StateChangedUtc,
DateTime LastSeenUtc,
string? Detail);
string? Detail,
int ConsecutiveFailures,
DateTime? LastCircuitBreakerOpenUtc,
int CurrentBulkheadDepth,
DateTime? LastRecycleUtc);
/// <summary>
/// Read-side service for the Admin UI's per-host drill-down. Loads
@@ -36,15 +41,26 @@ public sealed class HostStatusService(OtOpcUaConfigDbContext db)
{
public static readonly TimeSpan StaleThreshold = TimeSpan.FromSeconds(30);
/// <summary>Consecutive-failure threshold at which <see cref="IsFlagged"/> returns <c>true</c>
/// so the Admin UI can paint a red badge. Matches Phase 6.1 decision #143's conservative
/// half-of-breaker-threshold convention — flags before the breaker actually opens.</summary>
public const int FailureFlagThreshold = 3;
public async Task<IReadOnlyList<HostStatusRow>> ListAsync(CancellationToken ct = default)
{
// LEFT JOIN on NodeId so a row persists even when its owning ClusterNode row hasn't
// been created yet (first-boot bootstrap case — keeps the UI from losing sight of
// the reporting server).
// Two LEFT JOINs:
// 1. ClusterNodes on NodeId — row persists even when its owning ClusterNode row
// hasn't been created yet (first-boot bootstrap case).
// 2. DriverInstanceResilienceStatuses on (DriverInstanceId, HostName) — resilience
// counters haven't been sampled yet for brand-new hosts, so a missing row means
// zero failures + never-opened breaker.
var rows = await (from s in db.DriverHostStatuses.AsNoTracking()
join n in db.ClusterNodes.AsNoTracking()
on s.NodeId equals n.NodeId into nodeJoin
from n in nodeJoin.DefaultIfEmpty()
join r in db.DriverInstanceResilienceStatuses.AsNoTracking()
on new { s.DriverInstanceId, s.HostName } equals new { r.DriverInstanceId, r.HostName } into resilJoin
from r in resilJoin.DefaultIfEmpty()
orderby s.NodeId, s.DriverInstanceId, s.HostName
select new HostStatusRow(
s.NodeId,
@@ -54,10 +70,21 @@ public sealed class HostStatusService(OtOpcUaConfigDbContext db)
s.State,
s.StateChangedUtc,
s.LastSeenUtc,
s.Detail)).ToListAsync(ct);
s.Detail,
r != null ? r.ConsecutiveFailures : 0,
r != null ? r.LastCircuitBreakerOpenUtc : null,
r != null ? r.CurrentBulkheadDepth : 0,
r != null ? r.LastRecycleUtc : null)).ToListAsync(ct);
return rows;
}
public static bool IsStale(HostStatusRow row) =>
DateTime.UtcNow - row.LastSeenUtc > StaleThreshold;
/// <summary>
/// Red-badge predicate — <c>true</c> when the host has accumulated enough consecutive
/// failures that an operator should take notice before the breaker trips.
/// </summary>
public static bool IsFlagged(HostStatusRow row) =>
row.ConsecutiveFailures >= FailureFlagThreshold;
}

View File

@@ -5,7 +5,7 @@ using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
public sealed class NodeAclService(OtOpcUaConfigDbContext db)
public sealed class NodeAclService(OtOpcUaConfigDbContext db, AclChangeNotifier? notifier = null)
{
public Task<List<NodeAcl>> ListAsync(long generationId, CancellationToken ct) =>
db.NodeAcls.AsNoTracking()
@@ -31,6 +31,10 @@ public sealed class NodeAclService(OtOpcUaConfigDbContext db)
};
db.NodeAcls.Add(acl);
await db.SaveChangesAsync(ct);
if (notifier is not null)
await notifier.NotifyNodeAclChangedAsync(clusterId, draftId, ct);
return acl;
}
@@ -40,5 +44,8 @@ public sealed class NodeAclService(OtOpcUaConfigDbContext db)
if (row is null) return;
db.NodeAcls.Remove(row);
await db.SaveChangesAsync(ct);
if (notifier is not null)
await notifier.NotifyNodeAclChangedAsync(row.ClusterId, row.GenerationId, ct);
}
}

View File

@@ -0,0 +1,63 @@
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
using ZB.MOM.WW.OtOpcUa.Core.Authorization;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Runs an ad-hoc permission probe against a draft or published generation's NodeAcl rows —
/// "if LDAP group X asks for permission Y on node Z, would the trie grant it, and which
/// rows contributed?" Powers the AclsTab "Probe this permission" form per the #196 sub-slice.
/// </summary>
/// <remarks>
/// Thin wrapper over <see cref="PermissionTrieBuilder"/> + <see cref="PermissionTrie.CollectMatches"/> —
/// the same code path the Server's dispatch layer uses at request time, so a probe result
/// is guaranteed to match what the live server would decide. The probe is read-only + has
/// no side effects; failing probes do NOT generate audit log rows.
/// </remarks>
public sealed class PermissionProbeService(OtOpcUaConfigDbContext db)
{
/// <summary>
/// Evaluate <paramref name="required"/> against the NodeAcl rows of
/// <paramref name="generationId"/> for a request by <paramref name="ldapGroup"/> at
/// <paramref name="scope"/>. Returns whether the permission would be granted + the list
/// of matching grants so the UI can show *why*.
/// </summary>
public async Task<PermissionProbeResult> ProbeAsync(
long generationId,
string ldapGroup,
NodeScope scope,
NodePermissions required,
CancellationToken ct)
{
ArgumentException.ThrowIfNullOrWhiteSpace(ldapGroup);
ArgumentNullException.ThrowIfNull(scope);
var rows = await db.NodeAcls.AsNoTracking()
.Where(a => a.GenerationId == generationId && a.ClusterId == scope.ClusterId)
.ToListAsync(ct).ConfigureAwait(false);
var trie = PermissionTrieBuilder.Build(scope.ClusterId, generationId, rows);
var matches = trie.CollectMatches(scope, [ldapGroup]);
var effective = NodePermissions.None;
foreach (var m in matches)
effective |= m.PermissionFlags;
var granted = (effective & required) == required;
return new PermissionProbeResult(
Granted: granted,
Required: required,
Effective: effective,
Matches: matches);
}
}
/// <summary>Outcome of a <see cref="PermissionProbeService.ProbeAsync"/> call.</summary>
public sealed record PermissionProbeResult(
bool Granted,
NodePermissions Required,
NodePermissions Effective,
IReadOnlyList<MatchedGrant> Matches);

View File

@@ -0,0 +1,102 @@
using System.Diagnostics.Metrics;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// OpenTelemetry-compatible instrumentation for the redundancy surface. Uses in-box
/// <see cref="System.Diagnostics.Metrics"/> so no NuGet dependency is required to emit —
/// any MeterListener (dotnet-counters, OpenTelemetry.Extensions.Hosting OTLP exporter,
/// Prometheus exporter, etc.) picks up the instruments by the <see cref="MeterName"/>.
/// </summary>
/// <remarks>
/// Exporter configuration (OTLP, Prometheus, etc.) is intentionally NOT wired here —
/// that's a deployment-ops decision that belongs in <c>Program.cs</c> behind an
/// <c>appsettings</c> toggle. This class owns only the Meter + instruments so the
/// production data stream exists regardless of exporter availability.
///
/// Counter + gauge names follow the otel-semantic-conventions pattern:
/// <c>otopcua.redundancy.*</c> with tags for ClusterId + (for transitions) FromRole/ToRole/NodeId.
/// </remarks>
public sealed class RedundancyMetrics : IDisposable
{
public const string MeterName = "ZB.MOM.WW.OtOpcUa.Redundancy";
private readonly Meter _meter;
private readonly Counter<long> _roleTransitions;
private readonly object _gaugeLock = new();
private readonly Dictionary<string, ClusterGaugeState> _gaugeState = new();
public RedundancyMetrics()
{
_meter = new Meter(MeterName, version: "1.0.0");
_roleTransitions = _meter.CreateCounter<long>(
"otopcua.redundancy.role_transition",
unit: "{transition}",
description: "Observed RedundancyRole changes per node — tagged FromRole, ToRole, NodeId, ClusterId.");
// Observable gauges — the callback reports whatever the last Observe*Count call stashed.
_meter.CreateObservableGauge(
"otopcua.redundancy.primary_count",
ObservePrimaryCounts,
unit: "{node}",
description: "Count of Primary-role nodes per cluster (should be 1 for N+1 redundant clusters, 0 during failover).");
_meter.CreateObservableGauge(
"otopcua.redundancy.secondary_count",
ObserveSecondaryCounts,
unit: "{node}",
description: "Count of Secondary-role nodes per cluster.");
_meter.CreateObservableGauge(
"otopcua.redundancy.stale_count",
ObserveStaleCounts,
unit: "{node}",
description: "Count of cluster nodes whose LastSeenAt is older than StaleThreshold.");
}
/// <summary>
/// Update the per-cluster snapshot consumed by the ObservableGauges. Poller calls this
/// at the end of every tick so the collectors see fresh numbers on the next observation
/// window (by default 1s for dotnet-counters, configurable per exporter).
/// </summary>
public void SetClusterCounts(string clusterId, int primary, int secondary, int stale)
{
lock (_gaugeLock)
{
_gaugeState[clusterId] = new ClusterGaugeState(primary, secondary, stale);
}
}
/// <summary>
/// Increment the role_transition counter when a node's RedundancyRole changes. Tags
/// allow breakdowns by from/to roles (e.g. Primary → Secondary for planned failover vs
/// Primary → Standalone for emergency recovery) + by cluster for multi-site fleets.
/// </summary>
public void RecordRoleTransition(string clusterId, string nodeId, string fromRole, string toRole)
{
_roleTransitions.Add(1,
new KeyValuePair<string, object?>("cluster.id", clusterId),
new KeyValuePair<string, object?>("node.id", nodeId),
new KeyValuePair<string, object?>("from_role", fromRole),
new KeyValuePair<string, object?>("to_role", toRole));
}
public void Dispose() => _meter.Dispose();
private IEnumerable<Measurement<long>> ObservePrimaryCounts() => SnapshotGauge(s => s.Primary);
private IEnumerable<Measurement<long>> ObserveSecondaryCounts() => SnapshotGauge(s => s.Secondary);
private IEnumerable<Measurement<long>> ObserveStaleCounts() => SnapshotGauge(s => s.Stale);
private IEnumerable<Measurement<long>> SnapshotGauge(Func<ClusterGaugeState, int> selector)
{
List<Measurement<long>> results;
lock (_gaugeLock)
{
results = new List<Measurement<long>>(_gaugeState.Count);
foreach (var (cluster, state) in _gaugeState)
results.Add(new Measurement<long>(selector(state),
new KeyValuePair<string, object?>("cluster.id", cluster)));
}
return results;
}
private readonly record struct ClusterGaugeState(int Primary, int Secondary, int Stale);
}

View File

@@ -0,0 +1,213 @@
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Pure-function impact preview for UNS structural moves per Phase 6.4 Stream A.2. Given
/// a <see cref="UnsMoveOperation"/> plus a snapshot of the draft's UNS tree and its
/// equipment + tag counts, returns an <see cref="UnsImpactPreview"/> the Admin UI shows
/// in a confirmation modal before committing the move.
/// </summary>
/// <remarks>
/// <para>Stateless + deterministic — testable without EF or a live draft. The caller
/// (Razor page) loads the draft's snapshot via the normal Configuration services, passes
/// it in, and the analyzer counts + categorises the impact. The returned
/// <see cref="UnsImpactPreview.RevisionToken"/> is the token the caller must re-check at
/// confirm time; a mismatch means another operator mutated the draft between preview +
/// confirm and the operation needs to be refreshed (decision on concurrent-edit safety
/// in Phase 6.4 Scope).</para>
///
/// <para>Cross-cluster moves are rejected here (decision #82) — equipment is
/// cluster-scoped; the UI disables the drop target and surfaces an Export/Import workflow
/// toast instead.</para>
/// </remarks>
public static class UnsImpactAnalyzer
{
/// <summary>Run the analyzer. Returns a populated preview or throws for invalid operations.</summary>
public static UnsImpactPreview Analyze(UnsTreeSnapshot snapshot, UnsMoveOperation move)
{
ArgumentNullException.ThrowIfNull(snapshot);
ArgumentNullException.ThrowIfNull(move);
// Cross-cluster guard — the analyzer refuses rather than silently re-homing.
if (!string.Equals(move.SourceClusterId, move.TargetClusterId, StringComparison.OrdinalIgnoreCase))
throw new CrossClusterMoveRejectedException(
"Equipment is cluster-scoped (decision #82). Use Export → Import to migrate equipment " +
"across clusters; drag/drop rejected.");
return move.Kind switch
{
UnsMoveKind.LineMove => AnalyzeLineMove(snapshot, move),
UnsMoveKind.AreaRename => AnalyzeAreaRename(snapshot, move),
UnsMoveKind.LineMerge => AnalyzeLineMerge(snapshot, move),
_ => throw new ArgumentOutOfRangeException(nameof(move), move.Kind, $"Unsupported move kind {move.Kind}"),
};
}
private static UnsImpactPreview AnalyzeLineMove(UnsTreeSnapshot snapshot, UnsMoveOperation move)
{
var line = snapshot.FindLine(move.SourceLineId!)
?? throw new UnsMoveValidationException($"Source line '{move.SourceLineId}' not found in draft {snapshot.DraftGenerationId}.");
var targetArea = snapshot.FindArea(move.TargetAreaId!)
?? throw new UnsMoveValidationException($"Target area '{move.TargetAreaId}' not found in draft {snapshot.DraftGenerationId}.");
var warnings = new List<string>();
if (targetArea.LineIds.Contains(line.LineId, StringComparer.OrdinalIgnoreCase))
warnings.Add($"Target area '{targetArea.Name}' already contains line '{line.Name}' — dropping a no-op move.");
// If the target area has a line with the same display name as the mover, warn about
// visual ambiguity even though the IDs differ (operators frequently reuse line names).
if (targetArea.LineIds.Any(lid =>
snapshot.FindLine(lid) is { } sibling &&
string.Equals(sibling.Name, line.Name, StringComparison.OrdinalIgnoreCase) &&
!string.Equals(sibling.LineId, line.LineId, StringComparison.OrdinalIgnoreCase)))
{
warnings.Add($"Target area '{targetArea.Name}' already has a line named '{line.Name}'. Consider renaming before the move.");
}
return new UnsImpactPreview
{
AffectedEquipmentCount = line.EquipmentCount,
AffectedTagCount = line.TagCount,
CascadeWarnings = warnings,
RevisionToken = snapshot.RevisionToken,
HumanReadableSummary =
$"Moving line '{line.Name}' from area '{snapshot.FindAreaByLineId(line.LineId)?.Name ?? "?"}' " +
$"to '{targetArea.Name}' will re-home {line.EquipmentCount} equipment + re-parent {line.TagCount} tags.",
};
}
private static UnsImpactPreview AnalyzeAreaRename(UnsTreeSnapshot snapshot, UnsMoveOperation move)
{
var area = snapshot.FindArea(move.SourceAreaId!)
?? throw new UnsMoveValidationException($"Source area '{move.SourceAreaId}' not found in draft {snapshot.DraftGenerationId}.");
var affectedEquipment = area.LineIds
.Select(lid => snapshot.FindLine(lid)?.EquipmentCount ?? 0)
.Sum();
var affectedTags = area.LineIds
.Select(lid => snapshot.FindLine(lid)?.TagCount ?? 0)
.Sum();
return new UnsImpactPreview
{
AffectedEquipmentCount = affectedEquipment,
AffectedTagCount = affectedTags,
CascadeWarnings = [],
RevisionToken = snapshot.RevisionToken,
HumanReadableSummary =
$"Renaming area '{area.Name}' → '{move.NewName}' cascades to {area.LineIds.Count} lines / " +
$"{affectedEquipment} equipment / {affectedTags} tags.",
};
}
private static UnsImpactPreview AnalyzeLineMerge(UnsTreeSnapshot snapshot, UnsMoveOperation move)
{
var src = snapshot.FindLine(move.SourceLineId!)
?? throw new UnsMoveValidationException($"Source line '{move.SourceLineId}' not found.");
var dst = snapshot.FindLine(move.TargetLineId!)
?? throw new UnsMoveValidationException($"Target line '{move.TargetLineId}' not found.");
var warnings = new List<string>();
if (!string.Equals(snapshot.FindAreaByLineId(src.LineId)?.AreaId,
snapshot.FindAreaByLineId(dst.LineId)?.AreaId,
StringComparison.OrdinalIgnoreCase))
{
warnings.Add($"Lines '{src.Name}' and '{dst.Name}' are in different areas. The merge will re-parent equipment + tags into '{dst.Name}'s area.");
}
return new UnsImpactPreview
{
AffectedEquipmentCount = src.EquipmentCount,
AffectedTagCount = src.TagCount,
CascadeWarnings = warnings,
RevisionToken = snapshot.RevisionToken,
HumanReadableSummary =
$"Merging line '{src.Name}' into '{dst.Name}': {src.EquipmentCount} equipment + {src.TagCount} tags re-parent. " +
$"The source line is deleted at commit.",
};
}
}
/// <summary>Kind of UNS structural move the analyzer understands.</summary>
public enum UnsMoveKind
{
/// <summary>Drag a whole line from one area to another.</summary>
LineMove,
/// <summary>Rename an area (cascades to the UNS paths of every equipment + tag below it).</summary>
AreaRename,
/// <summary>Merge two lines into one; source line's equipment + tags are re-parented.</summary>
LineMerge,
}
/// <summary>One UNS structural move request.</summary>
/// <param name="Kind">Move variant — selects which source + target fields are required.</param>
/// <param name="SourceClusterId">Cluster of the source node. Must match <see cref="TargetClusterId"/> (decision #82).</param>
/// <param name="TargetClusterId">Cluster of the target node.</param>
/// <param name="SourceAreaId">Source area id for <see cref="UnsMoveKind.AreaRename"/>.</param>
/// <param name="SourceLineId">Source line id for <see cref="UnsMoveKind.LineMove"/> / <see cref="UnsMoveKind.LineMerge"/>.</param>
/// <param name="TargetAreaId">Target area id for <see cref="UnsMoveKind.LineMove"/>.</param>
/// <param name="TargetLineId">Target line id for <see cref="UnsMoveKind.LineMerge"/>.</param>
/// <param name="NewName">New display name for <see cref="UnsMoveKind.AreaRename"/>.</param>
public sealed record UnsMoveOperation(
UnsMoveKind Kind,
string SourceClusterId,
string TargetClusterId,
string? SourceAreaId = null,
string? SourceLineId = null,
string? TargetAreaId = null,
string? TargetLineId = null,
string? NewName = null);
/// <summary>Snapshot of the UNS tree + counts the analyzer walks.</summary>
public sealed class UnsTreeSnapshot
{
public required long DraftGenerationId { get; init; }
public required DraftRevisionToken RevisionToken { get; init; }
public required IReadOnlyList<UnsAreaSummary> Areas { get; init; }
public required IReadOnlyList<UnsLineSummary> Lines { get; init; }
public UnsAreaSummary? FindArea(string areaId) =>
Areas.FirstOrDefault(a => string.Equals(a.AreaId, areaId, StringComparison.OrdinalIgnoreCase));
public UnsLineSummary? FindLine(string lineId) =>
Lines.FirstOrDefault(l => string.Equals(l.LineId, lineId, StringComparison.OrdinalIgnoreCase));
public UnsAreaSummary? FindAreaByLineId(string lineId) =>
Areas.FirstOrDefault(a => a.LineIds.Contains(lineId, StringComparer.OrdinalIgnoreCase));
}
public sealed record UnsAreaSummary(string AreaId, string Name, IReadOnlyList<string> LineIds);
public sealed record UnsLineSummary(string LineId, string Name, int EquipmentCount, int TagCount);
/// <summary>
/// Opaque per-draft revision fingerprint. Preview fetches the current token + stores it
/// in the <see cref="UnsImpactPreview.RevisionToken"/>. Confirm compares the token against
/// the draft's live value; mismatch means another operator mutated the draft between
/// preview + commit — raise <c>409 Conflict / refresh-required</c> in the UI.
/// </summary>
public sealed record DraftRevisionToken(string Value)
{
/// <summary>Compare two tokens for equality; null-safe.</summary>
public bool Matches(DraftRevisionToken? other) =>
other is not null &&
string.Equals(Value, other.Value, StringComparison.Ordinal);
}
/// <summary>Output of <see cref="UnsImpactAnalyzer.Analyze"/>.</summary>
public sealed class UnsImpactPreview
{
public required int AffectedEquipmentCount { get; init; }
public required int AffectedTagCount { get; init; }
public required IReadOnlyList<string> CascadeWarnings { get; init; }
public required DraftRevisionToken RevisionToken { get; init; }
public required string HumanReadableSummary { get; init; }
}
/// <summary>Thrown when a move targets a different cluster than the source (decision #82).</summary>
public sealed class CrossClusterMoveRejectedException(string message) : Exception(message);
/// <summary>Thrown when the move operation references a source / target that doesn't exist in the draft.</summary>
public sealed class UnsMoveValidationException(string message) : Exception(message);

View File

@@ -1,3 +1,5 @@
using System.Security.Cryptography;
using System.Text;
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
@@ -47,4 +49,132 @@ public sealed class UnsService(OtOpcUaConfigDbContext db)
await db.SaveChangesAsync(ct);
return line;
}
/// <summary>
/// Build the full UNS tree snapshot for the analyzer. Walks areas + lines in the draft
/// and counts equipment + tags per line. Returns the snapshot plus a deterministic
/// revision token computed by SHA-256'ing the sorted (kind, id, parent, name) tuples —
/// stable across processes + changes whenever any row is added / modified / deleted.
/// </summary>
public async Task<UnsTreeSnapshot> LoadSnapshotAsync(long generationId, CancellationToken ct)
{
var areas = await db.UnsAreas.AsNoTracking()
.Where(a => a.GenerationId == generationId)
.OrderBy(a => a.UnsAreaId)
.ToListAsync(ct);
var lines = await db.UnsLines.AsNoTracking()
.Where(l => l.GenerationId == generationId)
.OrderBy(l => l.UnsLineId)
.ToListAsync(ct);
var equipmentCounts = await db.Equipment.AsNoTracking()
.Where(e => e.GenerationId == generationId)
.GroupBy(e => e.UnsLineId)
.Select(g => new { LineId = g.Key, Count = g.Count() })
.ToListAsync(ct);
var equipmentByLine = equipmentCounts.ToDictionary(x => x.LineId, x => x.Count, StringComparer.OrdinalIgnoreCase);
var lineSummaries = lines.Select(l =>
new UnsLineSummary(
LineId: l.UnsLineId,
Name: l.Name,
EquipmentCount: equipmentByLine.GetValueOrDefault(l.UnsLineId),
TagCount: 0)).ToList();
var areaSummaries = areas.Select(a =>
new UnsAreaSummary(
AreaId: a.UnsAreaId,
Name: a.Name,
LineIds: lines.Where(l => string.Equals(l.UnsAreaId, a.UnsAreaId, StringComparison.OrdinalIgnoreCase))
.Select(l => l.UnsLineId).ToList())).ToList();
return new UnsTreeSnapshot
{
DraftGenerationId = generationId,
RevisionToken = ComputeRevisionToken(areas, lines),
Areas = areaSummaries,
Lines = lineSummaries,
};
}
/// <summary>
/// Atomic re-parent of a line to a new area inside the same draft. The caller must pass
/// the revision token it observed at preview time — a mismatch raises
/// <see cref="DraftRevisionConflictException"/> so the UI can show the 409 concurrent-edit
/// modal instead of silently overwriting a peer's work.
/// </summary>
public async Task MoveLineAsync(
long generationId,
DraftRevisionToken expected,
string lineId,
string targetAreaId,
CancellationToken ct)
{
ArgumentNullException.ThrowIfNull(expected);
ArgumentException.ThrowIfNullOrWhiteSpace(lineId);
ArgumentException.ThrowIfNullOrWhiteSpace(targetAreaId);
var supportsTx = db.Database.IsRelational();
Microsoft.EntityFrameworkCore.Storage.IDbContextTransaction? tx = null;
if (supportsTx) tx = await db.Database.BeginTransactionAsync(ct).ConfigureAwait(false);
try
{
var areas = await db.UnsAreas
.Where(a => a.GenerationId == generationId)
.OrderBy(a => a.UnsAreaId)
.ToListAsync(ct);
var lines = await db.UnsLines
.Where(l => l.GenerationId == generationId)
.OrderBy(l => l.UnsLineId)
.ToListAsync(ct);
var current = ComputeRevisionToken(areas, lines);
if (!current.Matches(expected))
throw new DraftRevisionConflictException(
$"Draft {generationId} changed since preview. Expected revision {expected.Value}, saw {current.Value}. " +
"Refresh + redo the move.");
var line = lines.FirstOrDefault(l => string.Equals(l.UnsLineId, lineId, StringComparison.OrdinalIgnoreCase))
?? throw new InvalidOperationException($"Line '{lineId}' not found in draft {generationId}.");
if (!areas.Any(a => string.Equals(a.UnsAreaId, targetAreaId, StringComparison.OrdinalIgnoreCase)))
throw new InvalidOperationException($"Target area '{targetAreaId}' not found in draft {generationId}.");
if (string.Equals(line.UnsAreaId, targetAreaId, StringComparison.OrdinalIgnoreCase))
return; // no-op drop — same area
line.UnsAreaId = targetAreaId;
await db.SaveChangesAsync(ct);
if (tx is not null) await tx.CommitAsync(ct).ConfigureAwait(false);
}
catch
{
if (tx is not null) await tx.RollbackAsync(ct).ConfigureAwait(false);
throw;
}
finally
{
if (tx is not null) await tx.DisposeAsync().ConfigureAwait(false);
}
}
private static DraftRevisionToken ComputeRevisionToken(IReadOnlyList<UnsArea> areas, IReadOnlyList<UnsLine> lines)
{
var sb = new StringBuilder(capacity: 256 + (areas.Count + lines.Count) * 80);
foreach (var a in areas.OrderBy(a => a.UnsAreaId, StringComparer.Ordinal))
sb.Append("A:").Append(a.UnsAreaId).Append('|').Append(a.Name).Append('|').Append(a.Notes ?? "").Append(';');
foreach (var l in lines.OrderBy(l => l.UnsLineId, StringComparer.Ordinal))
sb.Append("L:").Append(l.UnsLineId).Append('|').Append(l.UnsAreaId).Append('|').Append(l.Name).Append('|').Append(l.Notes ?? "").Append(';');
var hash = SHA256.HashData(Encoding.UTF8.GetBytes(sb.ToString()));
return new DraftRevisionToken(Convert.ToHexStringLower(hash)[..16]);
}
}
/// <summary>Thrown when a UNS move's expected revision token no longer matches the live draft
/// — another operator mutated the draft between preview + commit. Caller surfaces a 409-style
/// "refresh required" modal in the Admin UI.</summary>
public sealed class DraftRevisionConflictException(string message) : Exception(message);

View File

@@ -0,0 +1,117 @@
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Admin.Services;
/// <summary>
/// Draft-aware write surface over <see cref="NodeAcl"/>. Replaces direct
/// <see cref="NodeAclService"/> CRUD for Admin UI grant authoring; the raw service stays
/// as the read / delete surface. Enforces the invariants listed in Phase 6.2 Stream D.2:
/// scope-uniqueness per (LdapGroup, ScopeKind, ScopeId, GenerationId), grant shape
/// consistency, and no empty permission masks.
/// </summary>
/// <remarks>
/// <para>Per decision #129 grants are additive — <see cref="NodePermissions.None"/> is
/// rejected at write time. Explicit Deny is v2.1 and is not representable in the current
/// <c>NodeAcl</c> row; attempts to express it (e.g. empty permission set) surface as
/// <see cref="InvalidNodeAclGrantException"/>.</para>
///
/// <para>Draft scope: writes always target an unpublished (Draft-state) generation id.
/// Once a generation publishes, its rows are frozen.</para>
/// </remarks>
public sealed class ValidatedNodeAclAuthoringService(OtOpcUaConfigDbContext db)
{
/// <summary>Add a new grant row to the given draft generation.</summary>
public async Task<NodeAcl> GrantAsync(
long draftGenerationId,
string clusterId,
string ldapGroup,
NodeAclScopeKind scopeKind,
string? scopeId,
NodePermissions permissions,
string? notes,
CancellationToken cancellationToken)
{
ArgumentException.ThrowIfNullOrWhiteSpace(clusterId);
ArgumentException.ThrowIfNullOrWhiteSpace(ldapGroup);
ValidateGrantShape(scopeKind, scopeId, permissions);
await EnsureNoDuplicate(draftGenerationId, clusterId, ldapGroup, scopeKind, scopeId, cancellationToken).ConfigureAwait(false);
var row = new NodeAcl
{
GenerationId = draftGenerationId,
NodeAclId = $"acl-{Guid.NewGuid():N}"[..20],
ClusterId = clusterId,
LdapGroup = ldapGroup,
ScopeKind = scopeKind,
ScopeId = scopeId,
PermissionFlags = permissions,
Notes = notes,
};
db.NodeAcls.Add(row);
await db.SaveChangesAsync(cancellationToken).ConfigureAwait(false);
return row;
}
/// <summary>
/// Replace an existing grant's permission set in place. Validates the new shape;
/// rejects attempts to blank-out to None (that's a Revoke via <see cref="NodeAclService"/>).
/// </summary>
public async Task<NodeAcl> UpdatePermissionsAsync(
Guid nodeAclRowId,
NodePermissions newPermissions,
string? notes,
CancellationToken cancellationToken)
{
if (newPermissions == NodePermissions.None)
throw new InvalidNodeAclGrantException(
"Permission set cannot be None — revoke the row instead of writing an empty grant.");
var row = await db.NodeAcls.FirstOrDefaultAsync(a => a.NodeAclRowId == nodeAclRowId, cancellationToken).ConfigureAwait(false)
?? throw new InvalidNodeAclGrantException($"NodeAcl row {nodeAclRowId} not found.");
row.PermissionFlags = newPermissions;
if (notes is not null) row.Notes = notes;
await db.SaveChangesAsync(cancellationToken).ConfigureAwait(false);
return row;
}
private static void ValidateGrantShape(NodeAclScopeKind scopeKind, string? scopeId, NodePermissions permissions)
{
if (permissions == NodePermissions.None)
throw new InvalidNodeAclGrantException(
"Permission set cannot be None — grants must carry at least one flag (decision #129, additive only).");
if (scopeKind == NodeAclScopeKind.Cluster && !string.IsNullOrEmpty(scopeId))
throw new InvalidNodeAclGrantException(
"Cluster-scope grants must have null ScopeId. ScopeId only applies to sub-cluster scopes.");
if (scopeKind != NodeAclScopeKind.Cluster && string.IsNullOrEmpty(scopeId))
throw new InvalidNodeAclGrantException(
$"ScopeKind={scopeKind} requires a populated ScopeId.");
}
private async Task EnsureNoDuplicate(
long generationId, string clusterId, string ldapGroup, NodeAclScopeKind scopeKind, string? scopeId,
CancellationToken cancellationToken)
{
var exists = await db.NodeAcls.AsNoTracking()
.AnyAsync(a => a.GenerationId == generationId
&& a.ClusterId == clusterId
&& a.LdapGroup == ldapGroup
&& a.ScopeKind == scopeKind
&& a.ScopeId == scopeId,
cancellationToken).ConfigureAwait(false);
if (exists)
throw new InvalidNodeAclGrantException(
$"A grant for (LdapGroup={ldapGroup}, ScopeKind={scopeKind}, ScopeId={scopeId}) already exists in generation {generationId}. " +
"Update the existing row's permissions instead of inserting a duplicate.");
}
}
/// <summary>Thrown when a <see cref="NodeAcl"/> grant authoring request violates an invariant.</summary>
public sealed class InvalidNodeAclGrantException(string message) : Exception(message);

View File

@@ -16,10 +16,13 @@
<PackageReference Include="Novell.Directory.Ldap.NETStandard" Version="3.6.0"/>
<PackageReference Include="Microsoft.AspNetCore.SignalR.Client" Version="10.0.0"/>
<PackageReference Include="Serilog.AspNetCore" Version="9.0.0"/>
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.15.2"/>
<PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.15.2-beta.1"/>
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Configuration\ZB.MOM.WW.OtOpcUa.Configuration.csproj"/>
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Core\ZB.MOM.WW.OtOpcUa.Core.csproj"/>
</ItemGroup>
<ItemGroup>

View File

@@ -23,5 +23,10 @@
},
"Serilog": {
"MinimumLevel": "Information"
},
"Metrics": {
"Prometheus": {
"Enabled": true
}
}
}

View File

@@ -0,0 +1,10 @@
; Shipped analyzer releases.
; See https://github.com/dotnet/roslyn-analyzers/blob/main/src/Microsoft.CodeAnalysis.Analyzers/ReleaseTrackingAnalyzers.Help.md
## Release 1.0
### New Rules
Rule ID | Category | Severity | Notes
--------|----------|----------|-------
OTOPCUA0001 | OtOpcUa.Resilience | Warning | Direct driver-capability call bypasses CapabilityInvoker

View File

@@ -0,0 +1,2 @@
; Unshipped analyzer release.
; See https://github.com/dotnet/roslyn-analyzers/blob/main/src/Microsoft.CodeAnalysis.Analyzers/ReleaseTrackingAnalyzers.Help.md

View File

@@ -0,0 +1,143 @@
using System.Collections.Generic;
using System.Collections.Immutable;
using System.Linq;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Diagnostics;
using Microsoft.CodeAnalysis.Operations;
namespace ZB.MOM.WW.OtOpcUa.Analyzers;
/// <summary>
/// Diagnostic analyzer that flags direct invocations of Phase 6.1-wrapped driver-capability
/// methods when the call is NOT already running inside a <c>CapabilityInvoker.ExecuteAsync</c>,
/// <c>CapabilityInvoker.ExecuteWriteAsync</c>, or <c>AlarmSurfaceInvoker.*Async</c> lambda.
/// The wrapping is what gives us per-host breaker isolation, retry semantics, bulkhead-depth
/// accounting, and alarm-ack idempotence guards — raw calls bypass all of that.
/// </summary>
/// <remarks>
/// The analyzer matches by receiver-interface identity using Roslyn's semantic model, not by
/// method name, so a driver with an unusually-named method implementing <c>IReadable.ReadAsync</c>
/// still trips the rule. Lambda-context detection walks up the syntax tree from the call site
/// + checks whether any enclosing <c>InvocationExpressionSyntax</c> targets a member whose
/// containing type is <c>CapabilityInvoker</c> or <c>AlarmSurfaceInvoker</c>. The rule is
/// intentionally narrow: it does NOT try to enforce the capability argument matches the
/// method (e.g. ReadAsync wrapped in <c>ExecuteAsync(DriverCapability.Write, ...)</c> still
/// passes) — that'd require flow analysis beyond single-expression scope.
/// </remarks>
[DiagnosticAnalyzer(Microsoft.CodeAnalysis.LanguageNames.CSharp)]
public sealed class UnwrappedCapabilityCallAnalyzer : DiagnosticAnalyzer
{
public const string DiagnosticId = "OTOPCUA0001";
/// <summary>Interfaces whose methods must be called through the capability invoker.</summary>
private static readonly string[] GuardedInterfaces =
[
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IReadable",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IWritable",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.ITagDiscovery",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.ISubscribable",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IHostConnectivityProbe",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IAlarmSource",
"ZB.MOM.WW.OtOpcUa.Core.Abstractions.IHistoryProvider",
];
/// <summary>Wrapper types whose lambda arguments are the allowed home for guarded calls.</summary>
private static readonly string[] WrapperTypes =
[
"ZB.MOM.WW.OtOpcUa.Core.Resilience.CapabilityInvoker",
"ZB.MOM.WW.OtOpcUa.Core.Resilience.AlarmSurfaceInvoker",
];
private static readonly DiagnosticDescriptor Rule = new(
id: DiagnosticId,
title: "Driver capability call must be wrapped in CapabilityInvoker",
messageFormat: "Call to '{0}' is not wrapped in CapabilityInvoker.ExecuteAsync / ExecuteWriteAsync / AlarmSurfaceInvoker.*. Without the wrapping, Phase 6.1 resilience (retry, breaker, bulkhead, tracker telemetry) is bypassed for this call.",
category: "OtOpcUa.Resilience",
defaultSeverity: DiagnosticSeverity.Warning,
isEnabledByDefault: true,
description: "Phase 6.1 Stream A requires every IReadable/IWritable/ITagDiscovery/ISubscribable/IHostConnectivityProbe/IAlarmSource/IHistoryProvider call to route through the shared Polly pipeline. Direct calls skip the pipeline + lose per-host isolation, retry semantics, and telemetry. If the caller is Core/Server/Driver dispatch code, wrap the call in CapabilityInvoker.ExecuteAsync. If the caller is a unit test invoking the driver directly to test its wire-level behavior, either suppress with a pragma or move the suppression into a NoWarn for the test project.");
public override ImmutableArray<DiagnosticDescriptor> SupportedDiagnostics { get; } = ImmutableArray.Create(Rule);
public override void Initialize(AnalysisContext context)
{
context.ConfigureGeneratedCodeAnalysis(GeneratedCodeAnalysisFlags.None);
context.EnableConcurrentExecution();
context.RegisterOperationAction(AnalyzeInvocation, OperationKind.Invocation);
}
private static void AnalyzeInvocation(OperationAnalysisContext context)
{
var invocation = (Microsoft.CodeAnalysis.Operations.IInvocationOperation)context.Operation;
var method = invocation.TargetMethod;
// Narrow the rule to async wire calls. Synchronous accessors like
// IHostConnectivityProbe.GetHostStatuses() are pure in-memory snapshots + would never
// benefit from the Polly pipeline; flagging them just creates false-positives.
if (!IsAsyncReturningType(method.ReturnType)) return;
if (!ImplementsGuardedInterface(method)) return;
if (IsInsideWrapperLambda(invocation.Syntax, context.Operation.SemanticModel, context.CancellationToken)) return;
var diag = Diagnostic.Create(Rule, invocation.Syntax.GetLocation(), $"{method.ContainingType.Name}.{method.Name}");
context.ReportDiagnostic(diag);
}
private static bool IsAsyncReturningType(ITypeSymbol type)
{
var name = type.OriginalDefinition.ToDisplayString(SymbolDisplayFormat.FullyQualifiedFormat);
return name is "global::System.Threading.Tasks.Task"
or "global::System.Threading.Tasks.Task<TResult>"
or "global::System.Threading.Tasks.ValueTask"
or "global::System.Threading.Tasks.ValueTask<TResult>";
}
private static bool ImplementsGuardedInterface(IMethodSymbol method)
{
foreach (var iface in method.ContainingType.AllInterfaces.Concat(new[] { method.ContainingType }))
{
var ifaceFqn = iface.OriginalDefinition.ToDisplayString(SymbolDisplayFormat.FullyQualifiedFormat)
.Replace("global::", string.Empty);
if (!GuardedInterfaces.Contains(ifaceFqn)) continue;
foreach (var member in iface.GetMembers().OfType<IMethodSymbol>())
{
var impl = method.ContainingType.FindImplementationForInterfaceMember(member);
if (SymbolEqualityComparer.Default.Equals(impl, method) ||
SymbolEqualityComparer.Default.Equals(method.OriginalDefinition, member))
return true;
}
}
return false;
}
private static bool IsInsideWrapperLambda(SyntaxNode startNode, SemanticModel? semanticModel, System.Threading.CancellationToken ct)
{
if (semanticModel is null) return false;
for (var node = startNode.Parent; node is not null; node = node.Parent)
{
// We only care about an enclosing invocation — the call we're auditing must literally
// live inside a lambda (ParenthesizedLambda / SimpleLambda / AnonymousMethod) that is
// an argument of a CapabilityInvoker.Execute* / AlarmSurfaceInvoker.* call.
if (node is not InvocationExpressionSyntax outer) continue;
var sym = semanticModel.GetSymbolInfo(outer, ct).Symbol as IMethodSymbol;
if (sym is null) continue;
var outerTypeFqn = sym.ContainingType.OriginalDefinition.ToDisplayString(SymbolDisplayFormat.FullyQualifiedFormat)
.Replace("global::", string.Empty);
if (!WrapperTypes.Contains(outerTypeFqn)) continue;
// The call is wrapped IFF our startNode is transitively inside one of the outer
// call's argument lambdas. Walk the outer invocation's argument list + check whether
// any lambda body contains the startNode's position.
foreach (var arg in outer.ArgumentList.Arguments)
{
if (arg.Expression is not AnonymousFunctionExpressionSyntax lambda) continue;
if (lambda.Span.Contains(startNode.Span)) return true;
}
}
return false;
}
}

View File

@@ -0,0 +1,24 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<!-- Roslyn analyzers ship as netstandard2.0 so they load into the MSBuild compiler host
(which on .NET Framework 4.7.2 and .NET 6+ equally resolves netstandard2.0). -->
<TargetFramework>netstandard2.0</TargetFramework>
<Nullable>enable</Nullable>
<LangVersion>latest</LangVersion>
<IsPackable>false</IsPackable>
<IsRoslynComponent>true</IsRoslynComponent>
<EnforceExtendedAnalyzerRules>true</EnforceExtendedAnalyzerRules>
<RootNamespace>ZB.MOM.WW.OtOpcUa.Analyzers</RootNamespace>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.CodeAnalysis.CSharp" Version="5.3.0" PrivateAssets="all"/>
</ItemGroup>
<ItemGroup>
<AdditionalFiles Include="AnalyzerReleases.Shipped.md"/>
<AdditionalFiles Include="AnalyzerReleases.Unshipped.md"/>
</ItemGroup>
</Project>

View File

@@ -27,6 +27,24 @@ public sealed class DriverInstance
/// <summary>Schemaless per-driver-type JSON config. Validated against registered JSON schema at draft-publish time (decision #91).</summary>
public required string DriverConfig { get; set; }
/// <summary>
/// Optional per-instance overrides for the Phase 6.1 shared Polly resilience pipeline.
/// Null = use the driver's tier defaults (decision #143). When populated, expected shape:
/// <code>
/// {
/// "bulkheadMaxConcurrent": 16,
/// "bulkheadMaxQueue": 64,
/// "capabilityPolicies": {
/// "Read": { "timeoutSeconds": 5, "retryCount": 5, "breakerFailureThreshold": 3 },
/// "Write": { "timeoutSeconds": 5, "retryCount": 0, "breakerFailureThreshold": 5 }
/// }
/// }
/// </code>
/// Parsed at startup by <c>DriverResilienceOptionsParser</c>; every key is optional +
/// unrecognised keys are ignored so future shapes land without a migration.
/// </summary>
public string? ResilienceConfig { get; set; }
public ConfigGeneration? Generation { get; set; }
public ServerCluster? Cluster { get; set; }
}

View File

@@ -0,0 +1,44 @@
namespace ZB.MOM.WW.OtOpcUa.Configuration.Entities;
/// <summary>
/// Runtime resilience counters the CapabilityInvoker + MemoryTracking + MemoryRecycle
/// surfaces for each <c>(DriverInstanceId, HostName)</c> pair. Separate from
/// <see cref="DriverHostStatus"/> (which owns per-host <i>connectivity</i> state) so a
/// host that's Running but has tripped its breaker or is approaching its memory ceiling
/// shows up distinctly on Admin <c>/hosts</c>.
/// </summary>
/// <remarks>
/// Per <c>docs/v2/implementation/phase-6-1-resilience-and-observability.md</c> §Stream E.1.
/// The Admin UI left-joins this table on DriverHostStatus for display; rows are written
/// by the runtime via a HostedService that samples the tracker at a configurable
/// interval (default 5 s) — writes are non-critical, a missed sample is tolerated.
/// </remarks>
public sealed class DriverInstanceResilienceStatus
{
public required string DriverInstanceId { get; set; }
public required string HostName { get; set; }
/// <summary>Most recent time the circuit breaker for this (instance, host) opened; null if never.</summary>
public DateTime? LastCircuitBreakerOpenUtc { get; set; }
/// <summary>Rolling count of consecutive Polly pipeline failures for this (instance, host).</summary>
public int ConsecutiveFailures { get; set; }
/// <summary>Current Polly bulkhead depth (in-flight calls) for this (instance, host).</summary>
public int CurrentBulkheadDepth { get; set; }
/// <summary>Most recent process recycle time (Tier C only; null for in-process tiers).</summary>
public DateTime? LastRecycleUtc { get; set; }
/// <summary>
/// Post-init memory baseline captured by <c>MemoryTracking</c> (median of first
/// BaselineWindow samples). Zero while still warming up.
/// </summary>
public long BaselineFootprintBytes { get; set; }
/// <summary>Most recent footprint sample the tracker saw (steady-state read).</summary>
public long CurrentFootprintBytes { get; set; }
/// <summary>Row last-write timestamp — advances on every sampling tick.</summary>
public DateTime LastSampledUtc { get; set; }
}

View File

@@ -0,0 +1,68 @@
namespace ZB.MOM.WW.OtOpcUa.Configuration.Entities;
/// <summary>
/// Staged equipment-import batch per Phase 6.4 Stream B.2. Rows land in the child
/// <see cref="EquipmentImportRow"/> table under a batch header; operator reviews + either
/// drops (via <c>DropImportBatch</c>) or finalises (via <c>FinaliseImportBatch</c>) in one
/// bounded transaction. The live <c>Equipment</c> table never sees partial state.
/// </summary>
/// <remarks>
/// <para>User-scoped visibility: the preview modal only shows batches where
/// <see cref="CreatedBy"/> equals the current operator. Prevents accidental
/// cross-operator finalise during concurrent imports. An admin finalise / drop surface
/// can override this — tracked alongside the UI follow-up.</para>
///
/// <para><see cref="FinalisedAtUtc"/> stamps the moment the batch promoted from staging
/// into <c>Equipment</c>. Null = still in staging; non-null = archived / finalised.</para>
/// </remarks>
public sealed class EquipmentImportBatch
{
public Guid Id { get; set; }
public required string ClusterId { get; set; }
public required string CreatedBy { get; set; }
public DateTime CreatedAtUtc { get; set; }
public int RowsStaged { get; set; }
public int RowsAccepted { get; set; }
public int RowsRejected { get; set; }
public DateTime? FinalisedAtUtc { get; set; }
public ICollection<EquipmentImportRow> Rows { get; set; } = [];
}
/// <summary>
/// One staged row under an <see cref="EquipmentImportBatch"/>. Mirrors the decision #117
/// + decision #139 columns from the CSV importer's output + an
/// <see cref="IsAccepted"/> flag + a <see cref="RejectReason"/> string the preview modal
/// renders.
/// </summary>
public sealed class EquipmentImportRow
{
public Guid Id { get; set; }
public Guid BatchId { get; set; }
public int LineNumberInFile { get; set; }
public bool IsAccepted { get; set; }
public string? RejectReason { get; set; }
// Required (decision #117)
public required string ZTag { get; set; }
public required string MachineCode { get; set; }
public required string SAPID { get; set; }
public required string EquipmentId { get; set; }
public required string EquipmentUuid { get; set; }
public required string Name { get; set; }
public required string UnsAreaName { get; set; }
public required string UnsLineName { get; set; }
// Optional (decision #139 — OPC 40010 Identification)
public string? Manufacturer { get; set; }
public string? Model { get; set; }
public string? SerialNumber { get; set; }
public string? HardwareRevision { get; set; }
public string? SoftwareRevision { get; set; }
public string? YearOfConstruction { get; set; }
public string? AssetLocation { get; set; }
public string? ManufacturerUri { get; set; }
public string? DeviceManualUri { get; set; }
public EquipmentImportBatch? Batch { get; set; }
}

View File

@@ -0,0 +1,56 @@
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Configuration.Entities;
/// <summary>
/// Maps an LDAP group to an <see cref="AdminRole"/> for Admin UI access. Optionally scoped
/// to one <see cref="ClusterId"/>; when <see cref="IsSystemWide"/> is true, the grant
/// applies fleet-wide.
/// </summary>
/// <remarks>
/// <para>Per <c>docs/v2/plan.md</c> decisions #105 and #150 — this entity is <b>control-plane
/// only</b>. The OPC UA data-path evaluator does not read these rows; it reads
/// <see cref="NodeAcl"/> joined directly against the session's resolved LDAP group
/// memberships. Collapsing the two would let a user inherit tag permissions via an
/// admin-role claim path never intended as a data-path grant.</para>
///
/// <para>Uniqueness: <c>(LdapGroup, ClusterId)</c> — the same LDAP group may hold
/// different roles on different clusters, but only one row per cluster. A system-wide row
/// (<c>IsSystemWide = true</c>, <c>ClusterId = null</c>) stacks additively with any
/// cluster-scoped rows for the same group.</para>
/// </remarks>
public sealed class LdapGroupRoleMapping
{
/// <summary>Surrogate primary key.</summary>
public Guid Id { get; set; }
/// <summary>
/// LDAP group DN the membership query returns (e.g. <c>cn=fleet-admin,ou=groups,dc=corp,dc=example</c>).
/// Comparison is case-insensitive per LDAP conventions.
/// </summary>
public required string LdapGroup { get; set; }
/// <summary>Admin role this group grants.</summary>
public required AdminRole Role { get; set; }
/// <summary>
/// Cluster the grant applies to; <c>null</c> when <see cref="IsSystemWide"/> is true.
/// Foreign key to <see cref="ServerCluster.ClusterId"/>.
/// </summary>
public string? ClusterId { get; set; }
/// <summary>
/// <c>true</c> = grant applies across every cluster in the fleet; <c>ClusterId</c> must be null.
/// <c>false</c> = grant is cluster-scoped; <c>ClusterId</c> must be populated.
/// </summary>
public required bool IsSystemWide { get; set; }
/// <summary>Row creation timestamp (UTC).</summary>
public DateTime CreatedAtUtc { get; set; }
/// <summary>Optional human-readable note (e.g. "added 2026-04-19 for Warsaw fleet admin handoff").</summary>
public string? Notes { get; set; }
/// <summary>Navigation for EF core when the row is cluster-scoped.</summary>
public ServerCluster? Cluster { get; set; }
}

View File

@@ -0,0 +1,26 @@
namespace ZB.MOM.WW.OtOpcUa.Configuration.Enums;
/// <summary>
/// Admin UI roles per <c>admin-ui.md</c> §"Admin Roles" and Phase 6.2 Stream A.
/// These govern Admin UI capabilities (cluster CRUD, draft → publish, fleet-wide admin
/// actions) — they do NOT govern OPC UA data-path authorization, which reads
/// <see cref="Entities.NodeAcl"/> joined against LDAP group memberships directly.
/// </summary>
/// <remarks>
/// Per <c>docs/v2/plan.md</c> decision #150 the two concerns share zero runtime code path:
/// the control plane (Admin UI) consumes <see cref="Entities.LdapGroupRoleMapping"/>; the
/// data plane consumes <see cref="Entities.NodeAcl"/> rows directly. Having them in one
/// table would collapse the distinction + let a user inherit tag permissions via their
/// admin-role claim path.
/// </remarks>
public enum AdminRole
{
/// <summary>Read-only Admin UI access — can view cluster state, drafts, publish history.</summary>
ConfigViewer,
/// <summary>Can author drafts + submit for publish.</summary>
ConfigEditor,
/// <summary>Full Admin UI privileges including publish + fleet-admin actions.</summary>
FleetAdmin,
}

View File

@@ -0,0 +1,170 @@
using LiteDB;
namespace ZB.MOM.WW.OtOpcUa.Configuration.LocalCache;
/// <summary>
/// Generation-sealed LiteDB cache per <c>docs/v2/plan.md</c> decision #148 and Phase 6.1
/// Stream D.1. Each published generation writes one <b>read-only</b> LiteDB file under
/// <c>&lt;cache-root&gt;/&lt;clusterId&gt;/&lt;generationId&gt;.db</c>. A per-cluster
/// <c>CURRENT</c> text file holds the currently-active generation id; it is updated
/// atomically (temp file + <see cref="File.Replace(string, string, string?)"/>) only after
/// the sealed file is fully written.
/// </summary>
/// <remarks>
/// <para>Mixed-generation reads are impossible: any read opens the single file pointed to
/// by <c>CURRENT</c>, which is a coherent snapshot. Corruption of the CURRENT file or the
/// sealed file surfaces as <see cref="GenerationCacheUnavailableException"/> — the reader
/// fails closed rather than silently falling back to an older generation. Recovery path
/// is to re-fetch from the central DB (and the Phase 6.1 Stream C <c>UsingStaleConfig</c>
/// flag goes true until that succeeds).</para>
///
/// <para>This cache is the read-path fallback when the central DB is unreachable. The
/// write path (draft edits, publish) bypasses the cache and fails hard on DB outage per
/// Stream D.2 — inconsistent writes are worse than a temporary inability to edit.</para>
/// </remarks>
public sealed class GenerationSealedCache
{
private const string CollectionName = "generation";
private const string CurrentPointerFileName = "CURRENT";
private readonly string _cacheRoot;
/// <summary>Root directory for all clusters' sealed caches.</summary>
public string CacheRoot => _cacheRoot;
public GenerationSealedCache(string cacheRoot)
{
ArgumentException.ThrowIfNullOrWhiteSpace(cacheRoot);
_cacheRoot = cacheRoot;
Directory.CreateDirectory(_cacheRoot);
}
/// <summary>
/// Seal a generation: write the snapshot to <c>&lt;cluster&gt;/&lt;generationId&gt;.db</c>,
/// mark the file read-only, then atomically publish the <c>CURRENT</c> pointer. Existing
/// sealed files for prior generations are preserved (prune separately).
/// </summary>
public async Task SealAsync(GenerationSnapshot snapshot, CancellationToken ct = default)
{
ArgumentNullException.ThrowIfNull(snapshot);
ct.ThrowIfCancellationRequested();
var clusterDir = Path.Combine(_cacheRoot, snapshot.ClusterId);
Directory.CreateDirectory(clusterDir);
var sealedPath = Path.Combine(clusterDir, $"{snapshot.GenerationId}.db");
if (File.Exists(sealedPath))
{
// Already sealed — idempotent. Treat as no-op + update pointer in case an earlier
// seal succeeded but the pointer update failed (crash recovery).
WritePointerAtomically(clusterDir, snapshot.GenerationId);
return;
}
var tmpPath = sealedPath + ".tmp";
try
{
using (var db = new LiteDatabase(new ConnectionString { Filename = tmpPath, Upgrade = false }))
{
var col = db.GetCollection<GenerationSnapshot>(CollectionName);
col.Insert(snapshot);
}
File.Move(tmpPath, sealedPath);
File.SetAttributes(sealedPath, File.GetAttributes(sealedPath) | FileAttributes.ReadOnly);
WritePointerAtomically(clusterDir, snapshot.GenerationId);
}
catch
{
try { if (File.Exists(tmpPath)) File.Delete(tmpPath); } catch { /* best-effort */ }
throw;
}
await Task.CompletedTask;
}
/// <summary>
/// Read the current sealed snapshot for <paramref name="clusterId"/>. Throws
/// <see cref="GenerationCacheUnavailableException"/> when the pointer is missing
/// (first-boot-no-snapshot case) or when the sealed file is corrupt. Never silently
/// falls back to a prior generation.
/// </summary>
public Task<GenerationSnapshot> ReadCurrentAsync(string clusterId, CancellationToken ct = default)
{
ArgumentException.ThrowIfNullOrWhiteSpace(clusterId);
ct.ThrowIfCancellationRequested();
var clusterDir = Path.Combine(_cacheRoot, clusterId);
var pointerPath = Path.Combine(clusterDir, CurrentPointerFileName);
if (!File.Exists(pointerPath))
throw new GenerationCacheUnavailableException(
$"No sealed generation for cluster '{clusterId}' at '{clusterDir}'. First-boot case: the central DB must be reachable at least once before cache fallback is possible.");
long generationId;
try
{
var text = File.ReadAllText(pointerPath).Trim();
generationId = long.Parse(text, System.Globalization.CultureInfo.InvariantCulture);
}
catch (Exception ex)
{
throw new GenerationCacheUnavailableException(
$"CURRENT pointer at '{pointerPath}' is corrupt or unreadable.", ex);
}
var sealedPath = Path.Combine(clusterDir, $"{generationId}.db");
if (!File.Exists(sealedPath))
throw new GenerationCacheUnavailableException(
$"CURRENT points at generation {generationId} but '{sealedPath}' is missing — fails closed rather than serving an older generation.");
try
{
using var db = new LiteDatabase(new ConnectionString { Filename = sealedPath, ReadOnly = true });
var col = db.GetCollection<GenerationSnapshot>(CollectionName);
var snapshot = col.FindAll().FirstOrDefault()
?? throw new GenerationCacheUnavailableException(
$"Sealed file '{sealedPath}' contains no snapshot row — file is corrupt.");
return Task.FromResult(snapshot);
}
catch (GenerationCacheUnavailableException) { throw; }
catch (Exception ex) when (ex is LiteException or InvalidDataException or IOException
or NotSupportedException or FormatException)
{
throw new GenerationCacheUnavailableException(
$"Sealed file '{sealedPath}' is corrupt or unreadable — fails closed rather than falling back to an older generation.", ex);
}
}
/// <summary>Return the generation id the <c>CURRENT</c> pointer points at, or null if no pointer exists.</summary>
public long? TryGetCurrentGenerationId(string clusterId)
{
ArgumentException.ThrowIfNullOrWhiteSpace(clusterId);
var pointerPath = Path.Combine(_cacheRoot, clusterId, CurrentPointerFileName);
if (!File.Exists(pointerPath)) return null;
try
{
return long.Parse(File.ReadAllText(pointerPath).Trim(), System.Globalization.CultureInfo.InvariantCulture);
}
catch
{
return null;
}
}
private static void WritePointerAtomically(string clusterDir, long generationId)
{
var pointerPath = Path.Combine(clusterDir, CurrentPointerFileName);
var tmpPath = pointerPath + ".tmp";
File.WriteAllText(tmpPath, generationId.ToString(System.Globalization.CultureInfo.InvariantCulture));
if (File.Exists(pointerPath))
File.Replace(tmpPath, pointerPath, destinationBackupFileName: null);
else
File.Move(tmpPath, pointerPath);
}
}
/// <summary>Sealed cache is unreachable — caller must fail closed.</summary>
public sealed class GenerationCacheUnavailableException : Exception
{
public GenerationCacheUnavailableException(string message) : base(message) { }
public GenerationCacheUnavailableException(string message, Exception inner) : base(message, inner) { }
}

View File

@@ -0,0 +1,90 @@
using Microsoft.Extensions.Logging;
using Polly;
using Polly.Retry;
using Polly.Timeout;
namespace ZB.MOM.WW.OtOpcUa.Configuration.LocalCache;
/// <summary>
/// Wraps a central-DB fetch function with Phase 6.1 Stream D.2 resilience:
/// <b>timeout 2 s → retry 3× jittered → fallback to sealed cache</b>. Maintains the
/// <see cref="StaleConfigFlag"/> — fresh on central-DB success, stale on cache fallback.
/// </summary>
/// <remarks>
/// <para>Read-path only per plan. The write path (draft save, publish) bypasses this
/// wrapper entirely and fails hard on DB outage so inconsistent writes never land.</para>
///
/// <para>Fallback is triggered by <b>any exception</b> the fetch raises (central-DB
/// unreachable, SqlException, timeout). If the sealed cache also fails (no pointer,
/// corrupt file, etc.), <see cref="GenerationCacheUnavailableException"/> surfaces — caller
/// must fail the current request (InitializeAsync for a driver, etc.).</para>
/// </remarks>
public sealed class ResilientConfigReader
{
private readonly GenerationSealedCache _cache;
private readonly StaleConfigFlag _staleFlag;
private readonly ResiliencePipeline _pipeline;
private readonly ILogger<ResilientConfigReader> _logger;
public ResilientConfigReader(
GenerationSealedCache cache,
StaleConfigFlag staleFlag,
ILogger<ResilientConfigReader> logger,
TimeSpan? timeout = null,
int retryCount = 3)
{
_cache = cache;
_staleFlag = staleFlag;
_logger = logger;
var builder = new ResiliencePipelineBuilder()
.AddTimeout(new TimeoutStrategyOptions { Timeout = timeout ?? TimeSpan.FromSeconds(2) });
if (retryCount > 0)
{
builder.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = retryCount,
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
Delay = TimeSpan.FromMilliseconds(100),
MaxDelay = TimeSpan.FromSeconds(1),
ShouldHandle = new PredicateBuilder().Handle<Exception>(ex => ex is not OperationCanceledException),
});
}
_pipeline = builder.Build();
}
/// <summary>
/// Execute <paramref name="centralFetch"/> through the resilience pipeline. On full failure
/// (post-retry), reads the sealed cache for <paramref name="clusterId"/> and passes the
/// snapshot to <paramref name="fromSnapshot"/> to extract the requested shape.
/// </summary>
public async ValueTask<T> ReadAsync<T>(
string clusterId,
Func<CancellationToken, ValueTask<T>> centralFetch,
Func<GenerationSnapshot, T> fromSnapshot,
CancellationToken cancellationToken)
{
ArgumentException.ThrowIfNullOrWhiteSpace(clusterId);
ArgumentNullException.ThrowIfNull(centralFetch);
ArgumentNullException.ThrowIfNull(fromSnapshot);
try
{
var result = await _pipeline.ExecuteAsync(centralFetch, cancellationToken).ConfigureAwait(false);
_staleFlag.MarkFresh();
return result;
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
_logger.LogWarning(ex, "Central-DB read failed after retries; falling back to sealed cache for cluster {ClusterId}", clusterId);
// GenerationCacheUnavailableException surfaces intentionally — fails the caller's
// operation. StaleConfigFlag stays unchanged; the flag only flips when we actually
// served a cache snapshot.
var snapshot = await _cache.ReadCurrentAsync(clusterId, cancellationToken).ConfigureAwait(false);
_staleFlag.MarkStale();
return fromSnapshot(snapshot);
}
}
}

View File

@@ -0,0 +1,20 @@
namespace ZB.MOM.WW.OtOpcUa.Configuration.LocalCache;
/// <summary>
/// Thread-safe <c>UsingStaleConfig</c> signal per Phase 6.1 Stream D.3. Flips true whenever
/// a read falls back to a sealed cache snapshot; flips false on the next successful central-DB
/// round-trip. Surfaced on <c>/healthz</c> body and on the Admin <c>/hosts</c> page.
/// </summary>
public sealed class StaleConfigFlag
{
private int _stale;
/// <summary>True when the last config read was served from the sealed cache, not the central DB.</summary>
public bool IsStale => Volatile.Read(ref _stale) != 0;
/// <summary>Mark the current config as stale (a read fell back to the cache).</summary>
public void MarkStale() => Volatile.Write(ref _stale, 1);
/// <summary>Mark the current config as fresh (a central-DB read succeeded).</summary>
public void MarkFresh() => Volatile.Write(ref _stale, 0);
}

View File

@@ -0,0 +1,46 @@
using System;
using Microsoft.EntityFrameworkCore.Migrations;
#nullable disable
namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
{
/// <inheritdoc />
public partial class AddDriverInstanceResilienceStatus : Migration
{
/// <inheritdoc />
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.CreateTable(
name: "DriverInstanceResilienceStatus",
columns: table => new
{
DriverInstanceId = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: false),
HostName = table.Column<string>(type: "nvarchar(256)", maxLength: 256, nullable: false),
LastCircuitBreakerOpenUtc = table.Column<DateTime>(type: "datetime2(3)", nullable: true),
ConsecutiveFailures = table.Column<int>(type: "int", nullable: false),
CurrentBulkheadDepth = table.Column<int>(type: "int", nullable: false),
LastRecycleUtc = table.Column<DateTime>(type: "datetime2(3)", nullable: true),
BaselineFootprintBytes = table.Column<long>(type: "bigint", nullable: false),
CurrentFootprintBytes = table.Column<long>(type: "bigint", nullable: false),
LastSampledUtc = table.Column<DateTime>(type: "datetime2(3)", nullable: false)
},
constraints: table =>
{
table.PrimaryKey("PK_DriverInstanceResilienceStatus", x => new { x.DriverInstanceId, x.HostName });
});
migrationBuilder.CreateIndex(
name: "IX_DriverResilience_LastSampled",
table: "DriverInstanceResilienceStatus",
column: "LastSampledUtc");
}
/// <inheritdoc />
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.DropTable(
name: "DriverInstanceResilienceStatus");
}
}
}

View File

@@ -0,0 +1,62 @@
using System;
using Microsoft.EntityFrameworkCore.Migrations;
#nullable disable
namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
{
/// <inheritdoc />
public partial class AddLdapGroupRoleMapping : Migration
{
/// <inheritdoc />
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.CreateTable(
name: "LdapGroupRoleMapping",
columns: table => new
{
Id = table.Column<Guid>(type: "uniqueidentifier", nullable: false),
LdapGroup = table.Column<string>(type: "nvarchar(512)", maxLength: 512, nullable: false),
Role = table.Column<string>(type: "nvarchar(32)", maxLength: 32, nullable: false),
ClusterId = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: true),
IsSystemWide = table.Column<bool>(type: "bit", nullable: false),
CreatedAtUtc = table.Column<DateTime>(type: "datetime2(3)", nullable: false),
Notes = table.Column<string>(type: "nvarchar(512)", maxLength: 512, nullable: true)
},
constraints: table =>
{
table.PrimaryKey("PK_LdapGroupRoleMapping", x => x.Id);
table.ForeignKey(
name: "FK_LdapGroupRoleMapping_ServerCluster_ClusterId",
column: x => x.ClusterId,
principalTable: "ServerCluster",
principalColumn: "ClusterId",
onDelete: ReferentialAction.Cascade);
});
migrationBuilder.CreateIndex(
name: "IX_LdapGroupRoleMapping_ClusterId",
table: "LdapGroupRoleMapping",
column: "ClusterId");
migrationBuilder.CreateIndex(
name: "IX_LdapGroupRoleMapping_Group",
table: "LdapGroupRoleMapping",
column: "LdapGroup");
migrationBuilder.CreateIndex(
name: "UX_LdapGroupRoleMapping_Group_Cluster",
table: "LdapGroupRoleMapping",
columns: new[] { "LdapGroup", "ClusterId" },
unique: true,
filter: "[ClusterId] IS NOT NULL");
}
/// <inheritdoc />
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.DropTable(
name: "LdapGroupRoleMapping");
}
}
}

View File

@@ -0,0 +1,37 @@
using Microsoft.EntityFrameworkCore.Migrations;
#nullable disable
namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
{
/// <inheritdoc />
public partial class AddDriverInstanceResilienceConfig : Migration
{
/// <inheritdoc />
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.AddColumn<string>(
name: "ResilienceConfig",
table: "DriverInstance",
type: "nvarchar(max)",
nullable: true);
migrationBuilder.AddCheckConstraint(
name: "CK_DriverInstance_ResilienceConfig_IsJson",
table: "DriverInstance",
sql: "ResilienceConfig IS NULL OR ISJSON(ResilienceConfig) = 1");
}
/// <inheritdoc />
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.DropCheckConstraint(
name: "CK_DriverInstance_ResilienceConfig_IsJson",
table: "DriverInstance");
migrationBuilder.DropColumn(
name: "ResilienceConfig",
table: "DriverInstance");
}
}
}

View File

@@ -0,0 +1,91 @@
using System;
using Microsoft.EntityFrameworkCore.Migrations;
#nullable disable
namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
{
/// <inheritdoc />
public partial class AddEquipmentImportBatch : Migration
{
/// <inheritdoc />
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.CreateTable(
name: "EquipmentImportBatch",
columns: table => new
{
Id = table.Column<Guid>(type: "uniqueidentifier", nullable: false),
ClusterId = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: false),
CreatedBy = table.Column<string>(type: "nvarchar(128)", maxLength: 128, nullable: false),
CreatedAtUtc = table.Column<DateTime>(type: "datetime2(3)", nullable: false),
RowsStaged = table.Column<int>(type: "int", nullable: false),
RowsAccepted = table.Column<int>(type: "int", nullable: false),
RowsRejected = table.Column<int>(type: "int", nullable: false),
FinalisedAtUtc = table.Column<DateTime>(type: "datetime2(3)", nullable: true)
},
constraints: table =>
{
table.PrimaryKey("PK_EquipmentImportBatch", x => x.Id);
});
migrationBuilder.CreateTable(
name: "EquipmentImportRow",
columns: table => new
{
Id = table.Column<Guid>(type: "uniqueidentifier", nullable: false),
BatchId = table.Column<Guid>(type: "uniqueidentifier", nullable: false),
LineNumberInFile = table.Column<int>(type: "int", nullable: false),
IsAccepted = table.Column<bool>(type: "bit", nullable: false),
RejectReason = table.Column<string>(type: "nvarchar(512)", maxLength: 512, nullable: true),
ZTag = table.Column<string>(type: "nvarchar(128)", maxLength: 128, nullable: false),
MachineCode = table.Column<string>(type: "nvarchar(128)", maxLength: 128, nullable: false),
SAPID = table.Column<string>(type: "nvarchar(128)", maxLength: 128, nullable: false),
EquipmentId = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: false),
EquipmentUuid = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: false),
Name = table.Column<string>(type: "nvarchar(128)", maxLength: 128, nullable: false),
UnsAreaName = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: false),
UnsLineName = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: false),
Manufacturer = table.Column<string>(type: "nvarchar(256)", maxLength: 256, nullable: true),
Model = table.Column<string>(type: "nvarchar(256)", maxLength: 256, nullable: true),
SerialNumber = table.Column<string>(type: "nvarchar(256)", maxLength: 256, nullable: true),
HardwareRevision = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: true),
SoftwareRevision = table.Column<string>(type: "nvarchar(64)", maxLength: 64, nullable: true),
YearOfConstruction = table.Column<string>(type: "nvarchar(8)", maxLength: 8, nullable: true),
AssetLocation = table.Column<string>(type: "nvarchar(512)", maxLength: 512, nullable: true),
ManufacturerUri = table.Column<string>(type: "nvarchar(512)", maxLength: 512, nullable: true),
DeviceManualUri = table.Column<string>(type: "nvarchar(512)", maxLength: 512, nullable: true)
},
constraints: table =>
{
table.PrimaryKey("PK_EquipmentImportRow", x => x.Id);
table.ForeignKey(
name: "FK_EquipmentImportRow_EquipmentImportBatch_BatchId",
column: x => x.BatchId,
principalTable: "EquipmentImportBatch",
principalColumn: "Id",
onDelete: ReferentialAction.Cascade);
});
migrationBuilder.CreateIndex(
name: "IX_EquipmentImportBatch_Creator_Finalised",
table: "EquipmentImportBatch",
columns: new[] { "CreatedBy", "FinalisedAtUtc" });
migrationBuilder.CreateIndex(
name: "IX_EquipmentImportRow_Batch",
table: "EquipmentImportRow",
column: "BatchId");
}
/// <inheritdoc />
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.DropTable(
name: "EquipmentImportRow");
migrationBuilder.DropTable(
name: "EquipmentImportBatch");
}
}
}

View File

@@ -0,0 +1,172 @@
using Microsoft.EntityFrameworkCore.Migrations;
#nullable disable
namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
{
/// <summary>
/// Extends <c>dbo.sp_ComputeGenerationDiff</c> to emit <c>NodeAcl</c> rows alongside the
/// existing Namespace/DriverInstance/Equipment/Tag output — closes the final slice of
/// task #196 (DiffViewer ACL section). Logical id for NodeAcl is a composite
/// <c>LdapGroup|ScopeKind|ScopeId</c> triple so a Change row surfaces whether the grant
/// shifted permissions, moved scope, or was added/removed outright.
/// </summary>
/// <inheritdoc />
public partial class ExtendComputeGenerationDiffWithNodeAcl : Migration
{
/// <inheritdoc />
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql(Procs.ComputeGenerationDiffV2);
}
/// <inheritdoc />
protected override void Down(MigrationBuilder migrationBuilder)
{
migrationBuilder.Sql(Procs.ComputeGenerationDiffV1);
}
private static class Procs
{
/// <summary>V2 — adds the NodeAcl section to the diff output.</summary>
public const string ComputeGenerationDiffV2 = @"
CREATE OR ALTER PROCEDURE dbo.sp_ComputeGenerationDiff
@FromGenerationId bigint,
@ToGenerationId bigint
AS
BEGIN
SET NOCOUNT ON;
CREATE TABLE #diff (TableName nvarchar(32), LogicalId nvarchar(128), ChangeKind nvarchar(16));
WITH f AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @FromGenerationId),
t AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Namespace', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @FromGenerationId),
t AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'DriverInstance', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @FromGenerationId),
t AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Equipment', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @FromGenerationId),
t AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Tag', CONVERT(nvarchar(128), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
-- NodeAcl section. Logical id is the (LdapGroup, ScopeKind, ScopeId) triple so the diff
-- distinguishes same row with new permissions (Modified via CHECKSUM on PermissionFlags + Notes)
-- from a scope move (which surfaces as Added + Removed of different logical ids).
WITH f AS (
SELECT CONVERT(nvarchar(128), LdapGroup + '|' + CONVERT(nvarchar(16), ScopeKind) + '|' + ISNULL(ScopeId, '(cluster)')) AS LogicalId,
CHECKSUM(ClusterId, PermissionFlags, Notes) AS Sig
FROM dbo.NodeAcl WHERE GenerationId = @FromGenerationId),
t AS (
SELECT CONVERT(nvarchar(128), LdapGroup + '|' + CONVERT(nvarchar(16), ScopeKind) + '|' + ISNULL(ScopeId, '(cluster)')) AS LogicalId,
CHECKSUM(ClusterId, PermissionFlags, Notes) AS Sig
FROM dbo.NodeAcl WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'NodeAcl', COALESCE(f.LogicalId, t.LogicalId),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
SELECT TableName, LogicalId, ChangeKind FROM #diff;
DROP TABLE #diff;
END
";
/// <summary>V1 — exact proc shipped in migration 20260417215224_StoredProcedures. Restored on Down().</summary>
public const string ComputeGenerationDiffV1 = @"
CREATE OR ALTER PROCEDURE dbo.sp_ComputeGenerationDiff
@FromGenerationId bigint,
@ToGenerationId bigint
AS
BEGIN
SET NOCOUNT ON;
CREATE TABLE #diff (TableName nvarchar(32), LogicalId nvarchar(64), ChangeKind nvarchar(16));
WITH f AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @FromGenerationId),
t AS (SELECT NamespaceId AS LogicalId, CHECKSUM(NamespaceUri, Kind, Enabled, Notes) AS Sig FROM dbo.Namespace WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Namespace', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @FromGenerationId),
t AS (SELECT DriverInstanceId AS LogicalId, CHECKSUM(ClusterId, NamespaceId, Name, DriverType, Enabled, CONVERT(varchar(max), DriverConfig)) AS Sig FROM dbo.DriverInstance WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'DriverInstance', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @FromGenerationId),
t AS (SELECT EquipmentId AS LogicalId, CHECKSUM(EquipmentUuid, DriverInstanceId, UnsLineId, Name, MachineCode, ZTag, SAPID, EquipmentClassRef, Manufacturer, Model, SerialNumber) AS Sig FROM dbo.Equipment WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Equipment', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
WITH f AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @FromGenerationId),
t AS (SELECT TagId AS LogicalId, CHECKSUM(DriverInstanceId, DeviceId, EquipmentId, PollGroupId, FolderPath, Name, DataType, AccessLevel, WriteIdempotent, CONVERT(varchar(max), TagConfig)) AS Sig FROM dbo.Tag WHERE GenerationId = @ToGenerationId)
INSERT #diff
SELECT 'Tag', CONVERT(nvarchar(64), COALESCE(f.LogicalId, t.LogicalId)),
CASE WHEN f.LogicalId IS NULL THEN 'Added'
WHEN t.LogicalId IS NULL THEN 'Removed'
WHEN f.Sig <> t.Sig THEN 'Modified'
ELSE 'Unchanged' END
FROM f FULL OUTER JOIN t ON f.LogicalId = t.LogicalId
WHERE f.LogicalId IS NULL OR t.LogicalId IS NULL OR f.Sig <> t.Sig;
SELECT TableName, LogicalId, ChangeKind FROM #diff;
DROP TABLE #diff;
END
";
}
}
}

View File

@@ -413,6 +413,9 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<string>("ResilienceConfig")
.HasColumnType("nvarchar(max)");
b.HasKey("DriverInstanceRowId");
b.HasIndex("ClusterId");
@@ -431,9 +434,50 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
b.ToTable("DriverInstance", null, t =>
{
t.HasCheckConstraint("CK_DriverInstance_DriverConfig_IsJson", "ISJSON(DriverConfig) = 1");
t.HasCheckConstraint("CK_DriverInstance_ResilienceConfig_IsJson", "ResilienceConfig IS NULL OR ISJSON(ResilienceConfig) = 1");
});
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.DriverInstanceResilienceStatus", b =>
{
b.Property<string>("DriverInstanceId")
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<string>("HostName")
.HasMaxLength(256)
.HasColumnType("nvarchar(256)");
b.Property<long>("BaselineFootprintBytes")
.HasColumnType("bigint");
b.Property<int>("ConsecutiveFailures")
.HasColumnType("int");
b.Property<int>("CurrentBulkheadDepth")
.HasColumnType("int");
b.Property<long>("CurrentFootprintBytes")
.HasColumnType("bigint");
b.Property<DateTime?>("LastCircuitBreakerOpenUtc")
.HasColumnType("datetime2(3)");
b.Property<DateTime?>("LastRecycleUtc")
.HasColumnType("datetime2(3)");
b.Property<DateTime>("LastSampledUtc")
.HasColumnType("datetime2(3)");
b.HasKey("DriverInstanceId", "HostName");
b.HasIndex("LastSampledUtc")
.HasDatabaseName("IX_DriverResilience_LastSampled");
b.ToTable("DriverInstanceResilienceStatus", (string)null);
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.Equipment", b =>
{
b.Property<Guid>("EquipmentRowId")
@@ -560,6 +604,148 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
b.ToTable("Equipment", (string)null);
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.EquipmentImportBatch", b =>
{
b.Property<Guid>("Id")
.ValueGeneratedOnAdd()
.HasColumnType("uniqueidentifier");
b.Property<string>("ClusterId")
.IsRequired()
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<DateTime>("CreatedAtUtc")
.HasColumnType("datetime2(3)");
b.Property<string>("CreatedBy")
.IsRequired()
.HasMaxLength(128)
.HasColumnType("nvarchar(128)");
b.Property<DateTime?>("FinalisedAtUtc")
.HasColumnType("datetime2(3)");
b.Property<int>("RowsAccepted")
.HasColumnType("int");
b.Property<int>("RowsRejected")
.HasColumnType("int");
b.Property<int>("RowsStaged")
.HasColumnType("int");
b.HasKey("Id");
b.HasIndex("CreatedBy", "FinalisedAtUtc")
.HasDatabaseName("IX_EquipmentImportBatch_Creator_Finalised");
b.ToTable("EquipmentImportBatch", (string)null);
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.EquipmentImportRow", b =>
{
b.Property<Guid>("Id")
.ValueGeneratedOnAdd()
.HasColumnType("uniqueidentifier");
b.Property<string>("AssetLocation")
.HasMaxLength(512)
.HasColumnType("nvarchar(512)");
b.Property<Guid>("BatchId")
.HasColumnType("uniqueidentifier");
b.Property<string>("DeviceManualUri")
.HasMaxLength(512)
.HasColumnType("nvarchar(512)");
b.Property<string>("EquipmentId")
.IsRequired()
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<string>("EquipmentUuid")
.IsRequired()
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<string>("HardwareRevision")
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<bool>("IsAccepted")
.HasColumnType("bit");
b.Property<int>("LineNumberInFile")
.HasColumnType("int");
b.Property<string>("MachineCode")
.IsRequired()
.HasMaxLength(128)
.HasColumnType("nvarchar(128)");
b.Property<string>("Manufacturer")
.HasMaxLength(256)
.HasColumnType("nvarchar(256)");
b.Property<string>("ManufacturerUri")
.HasMaxLength(512)
.HasColumnType("nvarchar(512)");
b.Property<string>("Model")
.HasMaxLength(256)
.HasColumnType("nvarchar(256)");
b.Property<string>("Name")
.IsRequired()
.HasMaxLength(128)
.HasColumnType("nvarchar(128)");
b.Property<string>("RejectReason")
.HasMaxLength(512)
.HasColumnType("nvarchar(512)");
b.Property<string>("SAPID")
.IsRequired()
.HasMaxLength(128)
.HasColumnType("nvarchar(128)");
b.Property<string>("SerialNumber")
.HasMaxLength(256)
.HasColumnType("nvarchar(256)");
b.Property<string>("SoftwareRevision")
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<string>("UnsAreaName")
.IsRequired()
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<string>("UnsLineName")
.IsRequired()
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<string>("YearOfConstruction")
.HasMaxLength(8)
.HasColumnType("nvarchar(8)");
b.Property<string>("ZTag")
.IsRequired()
.HasMaxLength(128)
.HasColumnType("nvarchar(128)");
b.HasKey("Id");
b.HasIndex("BatchId")
.HasDatabaseName("IX_EquipmentImportRow_Batch");
b.ToTable("EquipmentImportRow", (string)null);
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.ExternalIdReservation", b =>
{
b.Property<Guid>("ReservationId")
@@ -624,6 +810,51 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
b.ToTable("ExternalIdReservation", (string)null);
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.LdapGroupRoleMapping", b =>
{
b.Property<Guid>("Id")
.ValueGeneratedOnAdd()
.HasColumnType("uniqueidentifier");
b.Property<string>("ClusterId")
.HasMaxLength(64)
.HasColumnType("nvarchar(64)");
b.Property<DateTime>("CreatedAtUtc")
.HasColumnType("datetime2(3)");
b.Property<bool>("IsSystemWide")
.HasColumnType("bit");
b.Property<string>("LdapGroup")
.IsRequired()
.HasMaxLength(512)
.HasColumnType("nvarchar(512)");
b.Property<string>("Notes")
.HasMaxLength(512)
.HasColumnType("nvarchar(512)");
b.Property<string>("Role")
.IsRequired()
.HasMaxLength(32)
.HasColumnType("nvarchar(32)");
b.HasKey("Id");
b.HasIndex("ClusterId");
b.HasIndex("LdapGroup")
.HasDatabaseName("IX_LdapGroupRoleMapping_Group");
b.HasIndex("LdapGroup", "ClusterId")
.IsUnique()
.HasDatabaseName("UX_LdapGroupRoleMapping_Group_Cluster")
.HasFilter("[ClusterId] IS NOT NULL");
b.ToTable("LdapGroupRoleMapping", (string)null);
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.Namespace", b =>
{
b.Property<Guid>("NamespaceRowId")
@@ -1142,6 +1373,27 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
b.Navigation("Generation");
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.EquipmentImportRow", b =>
{
b.HasOne("ZB.MOM.WW.OtOpcUa.Configuration.Entities.EquipmentImportBatch", "Batch")
.WithMany("Rows")
.HasForeignKey("BatchId")
.OnDelete(DeleteBehavior.Cascade)
.IsRequired();
b.Navigation("Batch");
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.LdapGroupRoleMapping", b =>
{
b.HasOne("ZB.MOM.WW.OtOpcUa.Configuration.Entities.ServerCluster", "Cluster")
.WithMany()
.HasForeignKey("ClusterId")
.OnDelete(DeleteBehavior.Cascade);
b.Navigation("Cluster");
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.Namespace", b =>
{
b.HasOne("ZB.MOM.WW.OtOpcUa.Configuration.Entities.ServerCluster", "Cluster")
@@ -1231,6 +1483,11 @@ namespace ZB.MOM.WW.OtOpcUa.Configuration.Migrations
b.Navigation("GenerationState");
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.EquipmentImportBatch", b =>
{
b.Navigation("Rows");
});
modelBuilder.Entity("ZB.MOM.WW.OtOpcUa.Configuration.Entities.ServerCluster", b =>
{
b.Navigation("Generations");

View File

@@ -28,6 +28,10 @@ public sealed class OtOpcUaConfigDbContext(DbContextOptions<OtOpcUaConfigDbConte
public DbSet<ConfigAuditLog> ConfigAuditLogs => Set<ConfigAuditLog>();
public DbSet<ExternalIdReservation> ExternalIdReservations => Set<ExternalIdReservation>();
public DbSet<DriverHostStatus> DriverHostStatuses => Set<DriverHostStatus>();
public DbSet<DriverInstanceResilienceStatus> DriverInstanceResilienceStatuses => Set<DriverInstanceResilienceStatus>();
public DbSet<LdapGroupRoleMapping> LdapGroupRoleMappings => Set<LdapGroupRoleMapping>();
public DbSet<EquipmentImportBatch> EquipmentImportBatches => Set<EquipmentImportBatch>();
public DbSet<EquipmentImportRow> EquipmentImportRows => Set<EquipmentImportRow>();
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
@@ -49,6 +53,9 @@ public sealed class OtOpcUaConfigDbContext(DbContextOptions<OtOpcUaConfigDbConte
ConfigureConfigAuditLog(modelBuilder);
ConfigureExternalIdReservation(modelBuilder);
ConfigureDriverHostStatus(modelBuilder);
ConfigureDriverInstanceResilienceStatus(modelBuilder);
ConfigureLdapGroupRoleMapping(modelBuilder);
ConfigureEquipmentImportBatch(modelBuilder);
}
private static void ConfigureServerCluster(ModelBuilder modelBuilder)
@@ -247,6 +254,8 @@ public sealed class OtOpcUaConfigDbContext(DbContextOptions<OtOpcUaConfigDbConte
{
t.HasCheckConstraint("CK_DriverInstance_DriverConfig_IsJson",
"ISJSON(DriverConfig) = 1");
t.HasCheckConstraint("CK_DriverInstance_ResilienceConfig_IsJson",
"ResilienceConfig IS NULL OR ISJSON(ResilienceConfig) = 1");
});
e.HasKey(x => x.DriverInstanceRowId);
e.Property(x => x.DriverInstanceRowId).HasDefaultValueSql("NEWSEQUENTIALID()");
@@ -256,6 +265,7 @@ public sealed class OtOpcUaConfigDbContext(DbContextOptions<OtOpcUaConfigDbConte
e.Property(x => x.Name).HasMaxLength(128);
e.Property(x => x.DriverType).HasMaxLength(32);
e.Property(x => x.DriverConfig).HasColumnType("nvarchar(max)");
e.Property(x => x.ResilienceConfig).HasColumnType("nvarchar(max)");
e.HasOne(x => x.Generation).WithMany().HasForeignKey(x => x.GenerationId).OnDelete(DeleteBehavior.Restrict);
e.HasOne(x => x.Cluster).WithMany().HasForeignKey(x => x.ClusterId).OnDelete(DeleteBehavior.Restrict);
@@ -512,4 +522,101 @@ public sealed class OtOpcUaConfigDbContext(DbContextOptions<OtOpcUaConfigDbConte
e.HasIndex(x => x.LastSeenUtc).HasDatabaseName("IX_DriverHostStatus_LastSeen");
});
}
private static void ConfigureDriverInstanceResilienceStatus(ModelBuilder modelBuilder)
{
modelBuilder.Entity<DriverInstanceResilienceStatus>(e =>
{
e.ToTable("DriverInstanceResilienceStatus");
e.HasKey(x => new { x.DriverInstanceId, x.HostName });
e.Property(x => x.DriverInstanceId).HasMaxLength(64);
e.Property(x => x.HostName).HasMaxLength(256);
e.Property(x => x.LastCircuitBreakerOpenUtc).HasColumnType("datetime2(3)");
e.Property(x => x.LastRecycleUtc).HasColumnType("datetime2(3)");
e.Property(x => x.LastSampledUtc).HasColumnType("datetime2(3)");
// LastSampledUtc drives the Admin UI's stale-sample filter same way DriverHostStatus's
// LastSeenUtc index does for connectivity rows.
e.HasIndex(x => x.LastSampledUtc).HasDatabaseName("IX_DriverResilience_LastSampled");
});
}
private static void ConfigureLdapGroupRoleMapping(ModelBuilder modelBuilder)
{
modelBuilder.Entity<LdapGroupRoleMapping>(e =>
{
e.ToTable("LdapGroupRoleMapping");
e.HasKey(x => x.Id);
e.Property(x => x.LdapGroup).HasMaxLength(512).IsRequired();
e.Property(x => x.Role).HasConversion<string>().HasMaxLength(32);
e.Property(x => x.ClusterId).HasMaxLength(64);
e.Property(x => x.CreatedAtUtc).HasColumnType("datetime2(3)");
e.Property(x => x.Notes).HasMaxLength(512);
// FK to ServerCluster when cluster-scoped; null for system-wide grants.
e.HasOne(x => x.Cluster)
.WithMany()
.HasForeignKey(x => x.ClusterId)
.OnDelete(DeleteBehavior.Cascade);
// Uniqueness: one row per (LdapGroup, ClusterId). Null ClusterId is treated as its own
// "bucket" so a system-wide row coexists with cluster-scoped rows for the same group.
// SQL Server treats NULL as a distinct value in unique-index comparisons by default
// since 2008 SP1 onwards under the session setting we use — tested in SchemaCompliance.
e.HasIndex(x => new { x.LdapGroup, x.ClusterId })
.IsUnique()
.HasDatabaseName("UX_LdapGroupRoleMapping_Group_Cluster");
// Hot-path lookup during cookie auth: "what grants does this user's set of LDAP
// groups carry?". Fires on every sign-in so the index earns its keep.
e.HasIndex(x => x.LdapGroup).HasDatabaseName("IX_LdapGroupRoleMapping_Group");
});
}
private static void ConfigureEquipmentImportBatch(ModelBuilder modelBuilder)
{
modelBuilder.Entity<EquipmentImportBatch>(e =>
{
e.ToTable("EquipmentImportBatch");
e.HasKey(x => x.Id);
e.Property(x => x.ClusterId).HasMaxLength(64);
e.Property(x => x.CreatedBy).HasMaxLength(128);
e.Property(x => x.CreatedAtUtc).HasColumnType("datetime2(3)");
e.Property(x => x.FinalisedAtUtc).HasColumnType("datetime2(3)");
// Admin preview modal filters by user; finalise / drop both hit this index.
e.HasIndex(x => new { x.CreatedBy, x.FinalisedAtUtc })
.HasDatabaseName("IX_EquipmentImportBatch_Creator_Finalised");
});
modelBuilder.Entity<EquipmentImportRow>(e =>
{
e.ToTable("EquipmentImportRow");
e.HasKey(x => x.Id);
e.Property(x => x.ZTag).HasMaxLength(128);
e.Property(x => x.MachineCode).HasMaxLength(128);
e.Property(x => x.SAPID).HasMaxLength(128);
e.Property(x => x.EquipmentId).HasMaxLength(64);
e.Property(x => x.EquipmentUuid).HasMaxLength(64);
e.Property(x => x.Name).HasMaxLength(128);
e.Property(x => x.UnsAreaName).HasMaxLength(64);
e.Property(x => x.UnsLineName).HasMaxLength(64);
e.Property(x => x.Manufacturer).HasMaxLength(256);
e.Property(x => x.Model).HasMaxLength(256);
e.Property(x => x.SerialNumber).HasMaxLength(256);
e.Property(x => x.HardwareRevision).HasMaxLength(64);
e.Property(x => x.SoftwareRevision).HasMaxLength(64);
e.Property(x => x.YearOfConstruction).HasMaxLength(8);
e.Property(x => x.AssetLocation).HasMaxLength(512);
e.Property(x => x.ManufacturerUri).HasMaxLength(512);
e.Property(x => x.DeviceManualUri).HasMaxLength(512);
e.Property(x => x.RejectReason).HasMaxLength(512);
e.HasOne(x => x.Batch)
.WithMany(b => b.Rows)
.HasForeignKey(x => x.BatchId)
.OnDelete(DeleteBehavior.Cascade);
e.HasIndex(x => x.BatchId).HasDatabaseName("IX_EquipmentImportRow_Batch");
});
}
}

View File

@@ -0,0 +1,47 @@
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
namespace ZB.MOM.WW.OtOpcUa.Configuration.Services;
/// <summary>
/// CRUD surface for <see cref="LdapGroupRoleMapping"/> — the control-plane mapping from
/// LDAP groups to Admin UI roles. Consumed only by Admin UI code paths; the OPC UA
/// data-path evaluator MUST NOT depend on this interface (see decision #150 and the
/// Phase 6.2 compliance check on control/data-plane separation).
/// </summary>
/// <remarks>
/// Per Phase 6.2 Stream A.2 this service is expected to run behind the Phase 6.1
/// <c>ResilientConfigReader</c> pipeline (timeout → retry → fallback-to-cache) so a
/// transient DB outage during sign-in falls back to the sealed snapshot rather than
/// denying every login.
/// </remarks>
public interface ILdapGroupRoleMappingService
{
/// <summary>List every mapping whose LDAP group matches one of <paramref name="ldapGroups"/>.</summary>
/// <remarks>
/// Hot path — fires on every sign-in. The default EF implementation relies on the
/// <c>IX_LdapGroupRoleMapping_Group</c> index. Case-insensitive per LDAP conventions.
/// </remarks>
Task<IReadOnlyList<LdapGroupRoleMapping>> GetByGroupsAsync(
IEnumerable<string> ldapGroups, CancellationToken cancellationToken);
/// <summary>Enumerate every mapping; Admin UI listing only.</summary>
Task<IReadOnlyList<LdapGroupRoleMapping>> ListAllAsync(CancellationToken cancellationToken);
/// <summary>Create a new grant.</summary>
/// <exception cref="InvalidLdapGroupRoleMappingException">
/// Thrown when the proposed row violates an invariant (IsSystemWide inconsistent with
/// ClusterId, duplicate (group, cluster) pair, etc.) — ValidatedLdapGroupRoleMappingService
/// is the write surface that enforces these; the raw service here surfaces DB-level violations.
/// </exception>
Task<LdapGroupRoleMapping> CreateAsync(LdapGroupRoleMapping row, CancellationToken cancellationToken);
/// <summary>Delete a mapping by its surrogate key.</summary>
Task DeleteAsync(Guid id, CancellationToken cancellationToken);
}
/// <summary>Thrown when <see cref="LdapGroupRoleMapping"/> authoring violates an invariant.</summary>
public sealed class InvalidLdapGroupRoleMappingException : Exception
{
public InvalidLdapGroupRoleMappingException(string message) : base(message) { }
}

View File

@@ -0,0 +1,69 @@
using Microsoft.EntityFrameworkCore;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
namespace ZB.MOM.WW.OtOpcUa.Configuration.Services;
/// <summary>
/// EF Core implementation of <see cref="ILdapGroupRoleMappingService"/>. Enforces the
/// "exactly one of (ClusterId, IsSystemWide)" invariant at the write surface so a
/// malformed row can't land in the DB.
/// </summary>
public sealed class LdapGroupRoleMappingService(OtOpcUaConfigDbContext db) : ILdapGroupRoleMappingService
{
public async Task<IReadOnlyList<LdapGroupRoleMapping>> GetByGroupsAsync(
IEnumerable<string> ldapGroups, CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(ldapGroups);
var groupSet = ldapGroups.ToList();
if (groupSet.Count == 0) return [];
return await db.LdapGroupRoleMappings
.AsNoTracking()
.Where(m => groupSet.Contains(m.LdapGroup))
.ToListAsync(cancellationToken)
.ConfigureAwait(false);
}
public async Task<IReadOnlyList<LdapGroupRoleMapping>> ListAllAsync(CancellationToken cancellationToken)
=> await db.LdapGroupRoleMappings
.AsNoTracking()
.OrderBy(m => m.LdapGroup)
.ThenBy(m => m.ClusterId)
.ToListAsync(cancellationToken)
.ConfigureAwait(false);
public async Task<LdapGroupRoleMapping> CreateAsync(LdapGroupRoleMapping row, CancellationToken cancellationToken)
{
ArgumentNullException.ThrowIfNull(row);
ValidateInvariants(row);
if (row.Id == Guid.Empty) row.Id = Guid.NewGuid();
if (row.CreatedAtUtc == default) row.CreatedAtUtc = DateTime.UtcNow;
db.LdapGroupRoleMappings.Add(row);
await db.SaveChangesAsync(cancellationToken).ConfigureAwait(false);
return row;
}
public async Task DeleteAsync(Guid id, CancellationToken cancellationToken)
{
var existing = await db.LdapGroupRoleMappings.FindAsync([id], cancellationToken).ConfigureAwait(false);
if (existing is null) return;
db.LdapGroupRoleMappings.Remove(existing);
await db.SaveChangesAsync(cancellationToken).ConfigureAwait(false);
}
private static void ValidateInvariants(LdapGroupRoleMapping row)
{
if (string.IsNullOrWhiteSpace(row.LdapGroup))
throw new InvalidLdapGroupRoleMappingException("LdapGroup must not be empty.");
if (row.IsSystemWide && !string.IsNullOrEmpty(row.ClusterId))
throw new InvalidLdapGroupRoleMappingException(
"IsSystemWide=true requires ClusterId to be null. A fleet-wide grant cannot also be cluster-scoped.");
if (!row.IsSystemWide && string.IsNullOrEmpty(row.ClusterId))
throw new InvalidLdapGroupRoleMappingException(
"IsSystemWide=false requires a populated ClusterId. A cluster-scoped grant needs its target cluster.");
}
}

Some files were not shown because too many files have changed in this diff Show More