From 5506b43ddcb24075c1228259448684539daa3cca Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 20 Apr 2026 01:30:56 -0400 Subject: [PATCH] =?UTF-8?q?Doc=20refresh=20(task=20#204)=20=E2=80=94=20ope?= =?UTF-8?q?rational=20docs=20for=20multi-process=20multi-driver=20OtOpcUa?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Five operational docs rewritten for v2 (multi-process, multi-driver, Config-DB authoritative): - docs/Configuration.md — replaced appsettings-only story with the two-layer model. appsettings.json is bootstrap only (Node identity, Config DB connection string, transport security, LDAP bind, logging). Authoritative config (clusters, namespaces, UNS, equipment, tags, driver instances, ACLs, role grants, poll groups) lives in the Config DB accessed via OtOpcUaConfigDbContext and edited through the Admin UI draft/publish workflow. Added v1-to-v2 migration index so operators can locate where each old section moved. Cross-links to docs/v2/config-db-schema.md + docs/v2/admin-ui.md. - docs/Redundancy.md — Phase 6.3 rewrite. Named every class under src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/: RedundancyCoordinator, RedundancyTopology, ApplyLeaseRegistry (publish fencing), PeerReachabilityTracker, RecoveryStateManager, ServiceLevelCalculator (pure function), RedundancyStatePublisher. Documented the full 11-band ServiceLevel matrix (Maintenance=0 through AuthoritativePrimary=255) from ServiceLevelCalculator.cs and the per-ClusterNode fields (RedundancyRole, ServiceLevelBase, ApplicationUri). Covered metrics (otopcua.redundancy.role_transition counter + primary/secondary/stale_count gauges on meter ZB.MOM.WW.OtOpcUa.Redundancy) and SignalR RoleChanged push from FleetStatusPoller to RedundancyTab.razor. - docs/security.md — preserved the transport-security section (still accurate) and added Phase 6.2 authorization. Four concerns now documented in one place: (1) transport security profiles, (2) OPC UA auth via LdapUserAuthenticator (note: task spec called this LdapAuthenticationProvider — actual class name is LdapUserAuthenticator in Server/Security/), (3) data-plane authorization via NodeAcl + PermissionTrie + AuthorizationGate — additive-only model per decision #129, ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag hierarchy, NodePermissions bundle, PermissionProbeService in Admin for "probe this permission", (4) control-plane authorization via LdapGroupRoleMapping + AdminRole (ConfigViewer / ConfigEditor / FleetAdmin, CanEdit / CanPublish policies) — deliberately independent of data-plane ACLs per decision #150. Documented the OTOPCUA0001 Roslyn analyzer (UnwrappedCapabilityCallAnalyzer) as the compile-time guard ensuring every driver-capability async call is wrapped by CapabilityInvoker. - docs/ServiceHosting.md — three-process rewrite: OtOpcUa Server (net10 x64, BackgroundService + AddWindowsService, hosts OPC UA endpoint + all non-Galaxy drivers), OtOpcUa Admin (net10 x64, Blazor Server + SignalR + /metrics via OpenTelemetry Prometheus exporter), OtOpcUa Galaxy.Host (.NET Framework 4.8 x86, NSSM-wrapped, env-variable driven, STA thread + MXAccess COM). Pipe ACL denies-Admins detail + non-elevated shell requirement captured from feedback memory. Divergence from CLAUDE.md: task spec said "TopShelf is still the service-installer wrapper per CLAUDE.md note" but no csproj in the repo references TopShelf — decision #30 replaced it with the generic host's AddWindowsService wrapper (per the doc comment on OpcUaServerService). Reflected the actual state + flagged this divergence here so someone can update CLAUDE.md separately. - docs/StatusDashboard.md — replaced the full v1 reference (dashboard endpoints, health check rules, StatusData DTO, etc.) with a short "superseded by Admin UI" pointer that preserves git-blame continuity + avoids broken links from other docs that reference it. Class references verified by reading: src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{RedundancyCoordinator, ServiceLevelCalculator, ApplyLeaseRegistry, RedundancyStatePublisher}.cs src/ZB.MOM.WW.OtOpcUa.Core/Authorization/{PermissionTrie, PermissionTrieBuilder, PermissionTrieCache, TriePermissionEvaluator, AuthorizationGate}.cs src/ZB.MOM.WW.OtOpcUa.Server/Security/{AuthorizationGate, LdapUserAuthenticator}.cs src/ZB.MOM.WW.OtOpcUa.Admin/{Program.cs, Services/AdminRoles.cs, Services/RedundancyMetrics.cs, Hubs/FleetStatusPoller.cs} src/ZB.MOM.WW.OtOpcUa.Server/Program.cs + appsettings.json src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/{Program.cs, Ipc/PipeServer.cs} src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/{ClusterNode, NodeAcl, LdapGroupRoleMapping}.cs src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/Configuration.md | 473 ++++++++++------------------------- docs/Redundancy.md | 223 ++++++----------- docs/ServiceHosting.md | 241 +++++++----------- docs/StatusDashboard.md | 280 +-------------------- docs/security.md | 528 ++++++++++++++++------------------------ 5 files changed, 509 insertions(+), 1236 deletions(-) diff --git a/docs/Configuration.md b/docs/Configuration.md index 213c978..b55ef1a 100644 --- a/docs/Configuration.md +++ b/docs/Configuration.md @@ -1,370 +1,141 @@ # Configuration -## Overview +## Two-layer model -The service loads configuration from `appsettings.json` at startup using the Microsoft.Extensions.Configuration stack. `AppConfiguration` is the root holder class that aggregates typed sections: `OpcUa`, `MxAccess`, `GalaxyRepository`, `Dashboard`, `Historian`, `Authentication`, and `Security`. Each section binds to a dedicated POCO class with sensible defaults, so the service runs with zero configuration on a standard deployment. +OtOpcUa configuration is split into two layers: -## Config Binding Pattern +| Layer | Where | Scope | Edited by | +|---|---|---|---| +| **Bootstrap** | `appsettings.json` per process | Enough to start the process and reach the Config DB | Local file edit + process restart | +| **Authoritative config** | Config DB (SQL Server) via `OtOpcUaConfigDbContext` | Clusters, namespaces, UNS hierarchy, equipment, tags, driver instances, ACLs, role grants, poll groups | Admin UI draft/publish workflow | -The production constructor in `OpcUaService` builds the configuration pipeline and binds each JSON section to its typed class: +The rule: if the setting describes *how the process connects to the rest of the world* (Config DB connection string, LDAP bind, transport security profile, node identity, logging), it lives in `appsettings.json`. If it describes *what the fleet does* (clusters, drivers, tags, UNS, ACLs), it lives in the Config DB and is edited through the Admin UI. -```csharp -var configuration = new ConfigurationBuilder() - .AddJsonFile("appsettings.json", optional: false) - .AddJsonFile($"appsettings.{Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT") ?? "Production"}.json", optional: true) - .AddEnvironmentVariables() - .Build(); +--- -_config = new AppConfiguration(); -configuration.GetSection("OpcUa").Bind(_config.OpcUa); -configuration.GetSection("MxAccess").Bind(_config.MxAccess); -configuration.GetSection("GalaxyRepository").Bind(_config.GalaxyRepository); -configuration.GetSection("Dashboard").Bind(_config.Dashboard); -configuration.GetSection("Historian").Bind(_config.Historian); -configuration.GetSection("Authentication").Bind(_config.Authentication); -configuration.GetSection("Security").Bind(_config.Security); -``` +## Bootstrap configuration (`appsettings.json`) -This pattern uses `IConfiguration.GetSection().Bind()` rather than `IOptions` because the service targets .NET Framework 4.8, where the full dependency injection container is not used. +Each of the three processes (Server, Admin, Galaxy.Host) reads its own `appsettings.json` plus environment overrides. -## Environment-Specific Overrides +### OtOpcUa Server — `src/ZB.MOM.WW.OtOpcUa.Server/appsettings.json` -The configuration pipeline supports three layers of override, applied in order: +Bootstrap-only. `Program.cs` reads four top-level sections: -1. `appsettings.json` -- base configuration (required) -2. `appsettings.{DOTNET_ENVIRONMENT}.json` -- environment-specific overlay (optional) -3. Environment variables -- highest priority, useful for deployment automation +| Section | Keys | Purpose | +|---|---|---| +| `Node` | `NodeId`, `ClusterId`, `ConfigDbConnectionString`, `LocalCachePath` | Identity + path to the Config DB + LiteDB offline cache path. | +| `OpcUaServer` | `EndpointUrl`, `ApplicationName`, `ApplicationUri`, `PkiStoreRoot`, `AutoAcceptUntrustedClientCertificates`, `SecurityProfile` | OPC UA endpoint + transport security. See [`security.md`](security.md). | +| `OpcUaServer:Ldap` | `Enabled`, `Server`, `Port`, `UseTls`, `AllowInsecureLdap`, `SearchBase`, `ServiceAccountDn`, `ServiceAccountPassword`, `GroupToRole`, `UserNameAttribute`, `GroupAttribute` | LDAP auth for OPC UA UserName tokens. See [`security.md`](security.md). | +| `Serilog` | Standard Serilog keys + `WriteJson` bool | Logging verbosity + optional JSON file sink for SIEM ingest. | +| `Authorization` | `StrictMode` (bool) | Flip `true` to fail-closed on sessions lacking LDAP group metadata. Default false during ACL rollouts. | +| `Metrics:Prometheus:Enabled` | bool | Toggles the `/metrics` endpoint. | -Set the `DOTNET_ENVIRONMENT` variable to load a named overlay file. For example, setting `DOTNET_ENVIRONMENT=Staging` loads `appsettings.Staging.json` if it exists. - -Environment variables follow the standard `Section__Property` naming convention. For example, `OpcUa__Port=5840` overrides the OPC UA port. - -## Configuration Sections - -### OpcUa - -Controls the OPC UA server endpoint and session limits. Defined in `OpcUaConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `BindAddress` | `string` | `"0.0.0.0"` | IP address or hostname the server binds to. Use `0.0.0.0` for all interfaces, `localhost` for local-only, or a specific IP | -| `Port` | `int` | `4840` | TCP port the OPC UA server listens on | -| `EndpointPath` | `string` | `"/LmxOpcUa"` | Path appended to the host URI | -| `ServerName` | `string` | `"LmxOpcUa"` | Server name presented to OPC UA clients | -| `GalaxyName` | `string` | `"ZB"` | Galaxy name used as the OPC UA namespace | -| `MaxSessions` | `int` | `100` | Maximum simultaneous OPC UA sessions | -| `SessionTimeoutMinutes` | `int` | `30` | Idle session timeout in minutes | -| `AlarmTrackingEnabled` | `bool` | `false` | Enables `AlarmConditionState` nodes for alarm attributes | -| `AlarmFilter.ObjectFilters` | `List` | `[]` | Wildcard template-name patterns (with `*`) that scope alarm tracking to matching objects and their descendants. Empty list disables filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) | -| `ApplicationUri` | `string?` | `null` | Explicit application URI for this server instance. Required when redundancy is enabled. Defaults to `urn:{GalaxyName}:LmxOpcUa` when null | - -### MxAccess - -Controls the MXAccess runtime connection used for live tag reads and writes. Defined in `MxAccessConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `ClientName` | `string` | `"LmxOpcUa"` | Client name registered with MXAccess | -| `NodeName` | `string?` | `null` | Optional Galaxy node name to target | -| `GalaxyName` | `string?` | `null` | Optional Galaxy name for MXAccess reference resolution | -| `ReadTimeoutSeconds` | `int` | `5` | Maximum wait for a live tag read | -| `WriteTimeoutSeconds` | `int` | `5` | Maximum wait for a write acknowledgment | -| `MaxConcurrentOperations` | `int` | `10` | Cap on concurrent MXAccess operations | -| `MonitorIntervalSeconds` | `int` | `5` | Connectivity monitor probe interval | -| `AutoReconnect` | `bool` | `true` | Automatically re-establish dropped MXAccess sessions | -| `ProbeTag` | `string?` | `null` | Optional tag used to verify the runtime returns fresh data | -| `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale | -| `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) | -| `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning | -| `RequestTimeoutSeconds` | `int` | `30` | Outer safety timeout applied to sync-over-async MxAccess operations invoked from the OPC UA stack thread (Read, Write, address-space rebuild probe sync). Backstop for the inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds`. A timed-out operation returns `BadTimeout`. Validator rejects values < 1 and warns if set below the inner Read/Write timeouts. See [MXAccess Bridge](MxAccessBridge.md#request-timeout-safety-backstop). Stability review 2026-04-13 Finding 3 | - -### GalaxyRepository - -Controls the Galaxy repository database connection used to build the OPC UA address space. Defined in `GalaxyRepositoryConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `ConnectionString` | `string` | `"Server=localhost;Database=ZB;Integrated Security=true;"` | SQL Server connection string for the Galaxy database | -| `ChangeDetectionIntervalSeconds` | `int` | `30` | How often the service polls for Galaxy deploy changes | -| `CommandTimeoutSeconds` | `int` | `30` | SQL command timeout for repository queries | -| `ExtendedAttributes` | `bool` | `false` | Load extended Galaxy attribute metadata into the OPC UA model | -| `Scope` | `GalaxyScope` | `"Galaxy"` | Controls how much of the Galaxy hierarchy is loaded. `Galaxy` loads all deployed objects (default). `LocalPlatform` loads only objects hosted by the platform deployed on this machine. See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter) | -| `PlatformName` | `string?` | `null` | Explicit platform hostname for `LocalPlatform` filtering. When null, uses `Environment.MachineName`. Only used when `Scope` is `LocalPlatform` | - -### Dashboard - -Controls the embedded HTTP status dashboard. Defined in `DashboardConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Enabled` | `bool` | `true` | Whether the status dashboard is hosted | -| `Port` | `int` | `8081` | HTTP port for the dashboard endpoint | -| `RefreshIntervalSeconds` | `int` | `10` | HTML auto-refresh interval in seconds | - -### Historian - -Controls the Wonderware Historian SDK connection for OPC UA historical data access. Defined in `HistorianConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Enabled` | `bool` | `false` | Enables OPC UA historical data access | -| `ServerName` | `string` | `"localhost"` | Single Historian server hostname used when `ServerNames` is empty. Preserved for backward compatibility with pre-cluster deployments | -| `ServerNames` | `List` | `[]` | Ordered list of Historian cluster nodes. When non-empty, supersedes `ServerName` and enables read-only cluster failover. See [Historical Data Access](HistoricalDataAccess.md#read-only-cluster-failover) | -| `FailureCooldownSeconds` | `int` | `60` | How long a failed cluster node is skipped before being re-tried. Zero disables the cooldown | -| `IntegratedSecurity` | `bool` | `true` | Use Windows authentication | -| `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false | -| `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false | -| `Port` | `int` | `32568` | Historian TCP port | -| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds (inner async bound) | -| `RequestTimeoutSeconds` | `int` | `60` | Outer safety timeout applied to sync-over-async Historian operations invoked from the OPC UA stack thread (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`). Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Validator rejects values < 1 and warns if set below `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 | -| `MaxValuesPerRead` | `int` | `10000` | Maximum values returned per `HistoryRead` request | - -### Authentication - -Controls user authentication and write authorization for the OPC UA server. Defined in `AuthenticationConfiguration`. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `AllowAnonymous` | `bool` | `true` | Accepts anonymous client connections when `true` | -| `AnonymousCanWrite` | `bool` | `true` | Permits anonymous users to write when `true` | - -#### LDAP Authentication - -When `Ldap.Enabled` is `true`, credentials are validated against the configured LDAP server and group membership determines OPC UA permissions. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Ldap.Enabled` | `bool` | `false` | Enables LDAP authentication | -| `Ldap.Host` | `string` | `localhost` | LDAP server hostname | -| `Ldap.Port` | `int` | `3893` | LDAP server port | -| `Ldap.BaseDN` | `string` | `dc=lmxopcua,dc=local` | Base DN for LDAP operations | -| `Ldap.BindDnTemplate` | `string` | `cn={username},dc=lmxopcua,dc=local` | Bind DN template (`{username}` is replaced) | -| `Ldap.ServiceAccountDn` | `string` | `""` | Service account DN for group lookups | -| `Ldap.ServiceAccountPassword` | `string` | `""` | Service account password | -| `Ldap.TimeoutSeconds` | `int` | `5` | Connection timeout | -| `Ldap.ReadOnlyGroup` | `string` | `ReadOnly` | LDAP group granting read-only access | -| `Ldap.WriteOperateGroup` | `string` | `WriteOperate` | LDAP group granting write access for FreeAccess/Operate attributes | -| `Ldap.WriteTuneGroup` | `string` | `WriteTune` | LDAP group granting write access for Tune attributes | -| `Ldap.WriteConfigureGroup` | `string` | `WriteConfigure` | LDAP group granting write access for Configure attributes | -| `Ldap.AlarmAckGroup` | `string` | `AlarmAck` | LDAP group granting alarm acknowledgment | - -#### Permission Model - -When LDAP is enabled, LDAP group membership is mapped to OPC UA session role NodeIds during authentication. All authenticated LDAP users can browse and read nodes regardless of group membership. Groups grant additional permissions: - -| LDAP Group | Permission | -|---|---| -| ReadOnly | No additional permissions (read-only access) | -| WriteOperate | Write FreeAccess and Operate attributes | -| WriteTune | Write Tune attributes | -| WriteConfigure | Write Configure attributes | -| AlarmAck | Acknowledge alarms | - -Users can belong to multiple groups. The `admin` user in the default GLAuth configuration belongs to all three groups. - -Write access depends on both the user's role and the Galaxy attribute's security classification. See the [Effective Permission Matrix](Security.md#effective-permission-matrix) in the Security Guide for the full breakdown. - -Example configuration: - -```json -"Authentication": { - "AllowAnonymous": true, - "AnonymousCanWrite": false, - "Ldap": { - "Enabled": true, - "Host": "localhost", - "Port": 3893, - "BaseDN": "dc=lmxopcua,dc=local", - "BindDnTemplate": "cn={username},dc=lmxopcua,dc=local", - "ServiceAccountDn": "cn=serviceaccount,dc=lmxopcua,dc=local", - "ServiceAccountPassword": "serviceaccount123", - "TimeoutSeconds": 5, - "ReadOnlyGroup": "ReadOnly", - "WriteOperateGroup": "WriteOperate", - "WriteTuneGroup": "WriteTune", - "WriteConfigureGroup": "WriteConfigure", - "AlarmAckGroup": "AlarmAck" - } -} -``` - -### Security - -Controls OPC UA transport security profiles and certificate handling. Defined in `SecurityProfileConfiguration`. See [Security Guide](security.md) for detailed usage. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Profiles` | `List` | `["None"]` | Security profiles to expose. Valid: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt` | -| `AutoAcceptClientCertificates` | `bool` | `true` | Auto-accept untrusted client certificates. Set to `false` in production | -| `RejectSHA1Certificates` | `bool` | `true` | Reject client certificates signed with SHA-1 | -| `MinimumCertificateKeySize` | `int` | `2048` | Minimum RSA key size for client certificates | -| `PkiRootPath` | `string?` | `null` | Override for PKI root directory. Defaults to `%LOCALAPPDATA%\OPC Foundation\pki` | -| `CertificateSubject` | `string?` | `null` | Override for server certificate subject. Defaults to `CN={ServerName}, O=ZB MOM, DC=localhost` | - -Example — production deployment with encrypted transport: - -```json -"Security": { - "Profiles": ["Basic256Sha256-SignAndEncrypt"], - "AutoAcceptClientCertificates": false, - "RejectSHA1Certificates": true, - "MinimumCertificateKeySize": 2048 -} -``` - -### Redundancy - -Controls non-transparent OPC UA redundancy. Defined in `RedundancyConfiguration`. See [Redundancy Guide](Redundancy.md) for detailed usage. - -| Property | Type | Default | Description | -|----------|------|---------|-------------| -| `Enabled` | `bool` | `false` | Enables redundancy mode and ServiceLevel computation | -| `Mode` | `string` | `"Warm"` | Redundancy mode: `Warm` or `Hot` | -| `Role` | `string` | `"Primary"` | Instance role: `Primary` (higher ServiceLevel) or `Secondary` | -| `ServerUris` | `List` | `[]` | ApplicationUri values for all servers in the redundant set | -| `ServiceLevelBase` | `int` | `200` | Base ServiceLevel when healthy (1-255). Secondary receives base - 50 | - -Example — two-instance redundant pair (Primary): - -```json -"Redundancy": { - "Enabled": true, - "Mode": "Warm", - "Role": "Primary", - "ServerUris": ["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"], - "ServiceLevelBase": 200 -} -``` - -## Feature Flags - -Three boolean properties act as feature flags that control optional subsystems: - -- **`OpcUa.AlarmTrackingEnabled`** -- When `true`, the node manager creates `AlarmConditionState` nodes for alarm attributes and monitors `InAlarm` transitions. Disabled by default because alarm tracking adds per-attribute overhead. -- **`OpcUa.AlarmFilter.ObjectFilters`** -- List of wildcard template-name patterns that scope alarm tracking to matching objects and their descendants. An empty list preserves the current unfiltered behavior; a non-empty list includes an object only when any name in its template derivation chain matches any pattern, then propagates the inclusion to every descendant in the containment hierarchy. `*` is the only wildcard, matching is case-insensitive, and the Galaxy `$` prefix on template names is normalized so operators can write `TestMachine*` instead of `$TestMachine*`. Each list entry may itself contain comma-separated patterns (`"TestMachine*, Pump_*"`) for convenience. When the list is non-empty but `AlarmTrackingEnabled` is `false`, the validator emits a warning because the filter has no effect. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the full matching algorithm and telemetry. -- **`Historian.Enabled`** -- When `true`, the service calls `HistorianPluginLoader.TryLoad(config)` to load the `ZB.MOM.WW.OtOpcUa.Historian.Aveva` plugin from the `Historian/` subfolder next to the host exe and registers the resulting `IHistorianDataSource` with the OPC UA server host. Disabled by default because not all deployments have a Historian instance -- when disabled the plugin is not probed and the Wonderware SDK DLLs are not required on the host. If the flag is `true` but the plugin or its SDK dependencies cannot be loaded, the server still starts and every history read returns `BadHistoryOperationUnsupported` with a warning in the log. -- **`GalaxyRepository.ExtendedAttributes`** -- When `true`, the repository loads additional Galaxy attribute metadata beyond the core set needed for the address space. Disabled by default to minimize startup query time. -- **`GalaxyRepository.Scope`** -- When set to `LocalPlatform`, the repository filters the hierarchy and attributes to only include objects hosted by the platform whose `node_name` matches this machine (or the explicit `PlatformName` override). Ancestor areas are retained to keep the browse tree connected. Default is `Galaxy` (load everything). See [Galaxy Repository — Platform Scope Filter](GalaxyRepository.md#platform-scope-filter). - -## Configuration Validation - -`ConfigurationValidator.ValidateAndLog()` runs at the start of `OpcUaService.Start()`. It logs every resolved configuration value at `Information` level and validates required constraints: - -- `OpcUa.Port` must be between 1 and 65535 -- `OpcUa.GalaxyName` must not be empty -- `MxAccess.ClientName` must not be empty -- `GalaxyRepository.ConnectionString` must not be empty -- `Security.MinimumCertificateKeySize` must be at least 2048 -- Unknown security profile names are logged as warnings -- `AutoAcceptClientCertificates = true` emits a warning -- Only-`None` profile configuration emits a warning -- `OpcUa.AlarmFilter.ObjectFilters` is non-empty while `OpcUa.AlarmTrackingEnabled = false` emits a warning (filter has no effect) -- `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true` -- `Historian.FailureCooldownSeconds` must be zero or positive -- `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored) -- `MxAccess.RuntimeStatusUnknownTimeoutSeconds` below 5s emits a warning (below the reasonable floor for MxAccess initial-resolution latency) -- `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true` -- `Redundancy.ServiceLevelBase` must be between 1 and 255 -- `Redundancy.ServerUris` should contain at least 2 entries when enabled -- Local `ApplicationUri` should appear in `Redundancy.ServerUris` - -If validation fails, the service throws `InvalidOperationException` and does not start. - -## Test Constructor Pattern - -`OpcUaService` provides an `internal` constructor that accepts pre-built dependencies instead of loading `appsettings.json`: - -```csharp -internal OpcUaService( - AppConfiguration config, - IMxProxy? mxProxy, - IGalaxyRepository? galaxyRepository, - IMxAccessClient? mxAccessClientOverride = null, - bool hasMxAccessClientOverride = false) -``` - -Integration tests use this constructor to inject substitute implementations of `IMxProxy`, `IGalaxyRepository`, and `IMxAccessClient`, bypassing the STA thread, COM interop, and SQL Server dependencies. The `hasMxAccessClientOverride` flag tells the service to use the injected `IMxAccessClient` directly instead of creating one from the `IMxProxy` on the STA thread. - -## Example appsettings.json +Minimal example: ```json { - "OpcUa": { - "BindAddress": "0.0.0.0", - "Port": 4840, - "EndpointPath": "/LmxOpcUa", - "ServerName": "LmxOpcUa", - "GalaxyName": "ZB", - "MaxSessions": 100, - "SessionTimeoutMinutes": 30, - "AlarmTrackingEnabled": false, - "AlarmFilter": { - "ObjectFilters": [] - }, - "ApplicationUri": null + "Serilog": { "MinimumLevel": "Information" }, + "Node": { + "NodeId": "node-dev-a", + "ClusterId": "cluster-dev", + "ConfigDbConnectionString": "Server=localhost,14330;Database=OtOpcUaConfig;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;", + "LocalCachePath": "config_cache.db" }, - "MxAccess": { - "ClientName": "LmxOpcUa", - "NodeName": null, - "GalaxyName": null, - "ReadTimeoutSeconds": 5, - "WriteTimeoutSeconds": 5, - "MaxConcurrentOperations": 10, - "MonitorIntervalSeconds": 5, - "AutoReconnect": true, - "ProbeTag": null, - "ProbeStaleThresholdSeconds": 60, - "RuntimeStatusProbesEnabled": true, - "RuntimeStatusUnknownTimeoutSeconds": 15, - "RequestTimeoutSeconds": 30 - }, - "GalaxyRepository": { - "ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;", - "ChangeDetectionIntervalSeconds": 30, - "CommandTimeoutSeconds": 30, - "ExtendedAttributes": false, - "Scope": "Galaxy", - "PlatformName": null - }, - "Dashboard": { - "Enabled": true, - "Port": 8081, - "RefreshIntervalSeconds": 10 - }, - "Historian": { - "Enabled": false, - "ServerName": "localhost", - "ServerNames": [], - "FailureCooldownSeconds": 60, - "IntegratedSecurity": true, - "UserName": null, - "Password": null, - "Port": 32568, - "CommandTimeoutSeconds": 30, - "RequestTimeoutSeconds": 60, - "MaxValuesPerRead": 10000 - }, - "Authentication": { - "AllowAnonymous": true, - "AnonymousCanWrite": true, - "Ldap": { - "Enabled": false - } - }, - "Security": { - "Profiles": ["None"], - "AutoAcceptClientCertificates": true, - "RejectSHA1Certificates": true, - "MinimumCertificateKeySize": 2048, - "PkiRootPath": null, - "CertificateSubject": null - }, - "Redundancy": { - "Enabled": false, - "Mode": "Warm", - "Role": "Primary", - "ServerUris": [], - "ServiceLevelBase": 200 + "OpcUaServer": { + "EndpointUrl": "opc.tcp://0.0.0.0:4840/OtOpcUa", + "ApplicationUri": "urn:node-dev-a:OtOpcUa", + "SecurityProfile": "None", + "AutoAcceptUntrustedClientCertificates": true, + "Ldap": { "Enabled": false } } } ``` + +### OtOpcUa Admin — `src/ZB.MOM.WW.OtOpcUa.Admin/appsettings.json` + +| Section | Purpose | +|---|---| +| `ConnectionStrings:ConfigDb` | SQL connection string — must point at the same Config DB every Server reaches. | +| `Authentication:Ldap` | LDAP bind for the Admin login form (same options shape as the Server's `OpcUaServer:Ldap`). | +| `CertTrust` | `CertTrustOptions` — file-system path under the Server's `PkiStoreRoot` so the Admin Certificates page can promote rejected client certs. | +| `Metrics:Prometheus:Enabled` | Toggles the `/metrics` scrape endpoint (default true). | +| `Serilog` | Logging. | + +### Galaxy.Host + +Environment-variable driven (`OTOPCUA_GALAXY_PIPE`, `OTOPCUA_ALLOWED_SID`, `OTOPCUA_GALAXY_SECRET`, `OTOPCUA_GALAXY_BACKEND`, `OTOPCUA_GALAXY_ZB_CONN`, `OTOPCUA_HISTORIAN_*`). No `appsettings.json` — the supervisor owns the launch environment. See [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process). + +### Environment overrides + +Standard .NET config layering applies: `appsettings.{Environment}.json`, then environment variables with `Section__Property` naming. `DOTNET_ENVIRONMENT` (or `ASPNETCORE_ENVIRONMENT` for Admin) selects the overlay. + +--- + +## Authoritative configuration (Config DB) + +The Config DB is the single source of truth for every setting that a v1 deployment used to carry in `appsettings.json` as driver-specific state. `OtOpcUaConfigDbContext` (`src/ZB.MOM.WW.OtOpcUa.Configuration/OtOpcUaConfigDbContext.cs`) is the EF Core context used by both the Admin writer and every Server reader. + +### Top-level sections operators touch + +| Concept | Entity | Admin UI surface | Purpose | +|---|---|---|---| +| Cluster | `ServerCluster` | Clusters pages | Fleet unit; owns nodes, generations, UNS, ACLs. | +| Cluster node | `ClusterNode` + `ClusterNodeCredential` | RedundancyTab, Hosts page | Per-node identity, `RedundancyRole`, `ServiceLevelBase`, ApplicationUri, service-account credentials. | +| Generation | `ConfigGeneration` + `ClusterNodeGenerationState` | Generations / DiffViewer | Append-only; draft → publish workflow (`sp_PublishGeneration`). | +| Namespace | `Namespace` | Namespaces tab | Per-cluster OPC UA namespace; `Kind` = Equipment / SystemPlatform / Simulated. | +| Driver instance | `DriverInstance` | Drivers tab | Configured driver (Modbus, S7, OpcUaClient, Galaxy, …) + `DriverConfig` JSON + resilience profile. | +| Device | `Device` | Under each driver instance | Per-host settings inside a driver instance (IP, port, unit-id…). | +| UNS hierarchy | `UnsArea` + `UnsLine` | UnsTab (drag/drop) | L3 / L4 of the unified namespace. | +| Equipment | `Equipment` | Equipment pages, CSV import | L5; carries `MachineCode`, `ZTag`, `SAPID`, `EquipmentUuid`, reservation-backed external ids. | +| Tag | `Tag` | Under each equipment | Driver-specific tag address + `SecurityClassification` + poll-group assignment. | +| Poll group | `PollGroup` | Driver-scoped | Poll cadence buckets; `PollGroupEngine` in Core.Abstractions uses this at runtime. | +| ACL | `NodeAcl` | AclsTab + Probe dialog | Per-level permission grants, additive only. See [`security.md`](security.md#data-plane-authorization). | +| Role grant | `LdapGroupRoleMapping` | RoleGrants page | Maps LDAP groups → Admin roles (`ConfigViewer` / `ConfigEditor` / `FleetAdmin`). | +| External id reservation | `ExternalIdReservation` | Reservations page | Reservation-backed `ZTag` and `SAPID` uniqueness. | +| Equipment import batch | `EquipmentImportBatch` | CSV import flow | Staged bulk-add with validation preview. | +| Audit log | `ConfigAuditLog` | Audit page | Append-only record of every publish, rollback, credential rotation, role-grant change. | + +### Draft → publish generation model + +All edits go into a **draft** generation scoped to one cluster. `DraftValidationService` checks invariants (same-cluster FKs, reservation collisions, UNS path consistency, ACL scope validity). When the operator clicks Publish, `sp_PublishGeneration` atomically promotes the draft, records the audit event, and causes every `RedundancyCoordinator.RefreshAsync` in the affected cluster to pick up the new topology + ACL set. The Admin UI `DiffViewer` shows exactly what's changing before publish. + +Old generations are retained; rollback is "publish older generation as new". `ConfigAuditLog` makes every change auditable by principal + timestamp. + +### Offline cache + +Each Server process caches the last-seen published generation in `Node:LocalCachePath` via LiteDB (`LiteDbConfigCache` in `src/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/`). The cache lets a node start without the central DB reachable; once the DB comes back, `NodeBootstrap` syncs to the current generation. + +### Full schema reference + +For table columns, indexes, stored procedures, the publish-transaction semantics, and the SQL authorization model (per-node SQL principals + `SESSION_CONTEXT` cluster binding), see [`docs/v2/config-db-schema.md`](v2/config-db-schema.md). + +### Admin UI flow + +For the draft editor, DiffViewer, CSV import, IdentificationFields, RedundancyTab, AclsTab + Probe-this-permission, RoleGrants, and the SignalR real-time surface, see [`docs/v2/admin-ui.md`](v2/admin-ui.md). + +--- + +## Where did v1 appsettings sections go? + +Quick index for operators coming from v1 LmxOpcUa: + +| v1 appsettings section | v2 home | +|---|---| +| `OpcUa.Port` / `BindAddress` / `EndpointPath` / `ServerName` | Bootstrap `OpcUaServer:EndpointUrl` + `ApplicationName`. | +| `OpcUa.ApplicationUri` | Config DB `ClusterNode.ApplicationUri`. | +| `OpcUa.MaxSessions` / `SessionTimeoutMinutes` | Bootstrap `OpcUaServer:*` (if exposed) or stack defaults. | +| `OpcUa.AlarmTrackingEnabled` / `AlarmFilter` | Per driver instance in Config DB (alarm surface is capability-driven per `IAlarmSource`). | +| `MxAccess.*` | Galaxy driver instance `DriverConfig` JSON + Galaxy.Host env vars (see [`ServiceHosting.md`](ServiceHosting.md#galaxyhost-process)). | +| `GalaxyRepository.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_GALAXY_ZB_CONN` env var. | +| `Dashboard.*` | Retired — Admin UI replaces the dashboard. See [`StatusDashboard.md`](StatusDashboard.md). | +| `Historian.*` | Galaxy driver instance `DriverConfig` JSON + `OTOPCUA_HISTORIAN_*` env vars. | +| `Authentication.Ldap.*` | Bootstrap `OpcUaServer:Ldap` (same shape) + Admin `Authentication:Ldap` for the UI login. | +| `Security.*` | Bootstrap `OpcUaServer:SecurityProfile` + `PkiStoreRoot` + `AutoAcceptUntrustedClientCertificates`. | +| `Redundancy.*` | Config DB `ClusterNode.RedundancyRole` + `ServiceLevelBase`. | + +--- + +## Validation + +- **Bootstrap**: the process fails fast on missing required keys in `Program.cs` (e.g. `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString` all throw `InvalidOperationException` if unset). +- **Authoritative**: `DraftValidationService` runs on every save; `sp_ValidateDraft` runs as part of `sp_PublishGeneration` so an invalid draft cannot reach any node. diff --git a/docs/Redundancy.md b/docs/Redundancy.md index f78a971..91ea62c 100644 --- a/docs/Redundancy.md +++ b/docs/Redundancy.md @@ -2,189 +2,102 @@ ## Overview -LmxOpcUa supports OPC UA **non-transparent redundancy** in Warm or Hot mode. In a non-transparent redundancy deployment, two independent server instances run side by side. Both connect to the same Galaxy repository database and the same MXAccess runtime, but each maintains its own OPC UA sessions and subscriptions. Clients discover the redundant set through the `ServerUriArray` exposed in each server's address space and are responsible for managing failover between the two endpoints. +OtOpcUa supports OPC UA **non-transparent** warm/hot redundancy. Two (or more) OtOpcUa Server processes run side-by-side, share the same Config DB, the same driver backends (Galaxy ZB, MXAccess runtime, remote PLCs), and advertise the same OPC UA node tree. Each process owns a distinct `ApplicationUri`; OPC UA clients see both endpoints via the standard `ServerUriArray` and pick one based on the `ServiceLevel` that each server publishes. -When redundancy is disabled (the default), the server reports `RedundancySupport.None` and a fixed `ServiceLevel` of 255. +The redundancy surface lives in `src/ZB.MOM.WW.OtOpcUa.Server/Redundancy/`: -## Namespace vs Application Identity - -Both servers in the redundant set share the same **namespace URI** so that clients see identical node IDs regardless of which instance they are connected to. The namespace URI follows the pattern `urn:{GalaxyName}:LmxOpcUa` (e.g., `urn:ZB:LmxOpcUa`). - -The **ApplicationUri**, on the other hand, must be unique per instance. This is how the OPC UA stack and clients distinguish one server from the other within the redundant set. Each instance sets its own ApplicationUri via the `OpcUa.ApplicationUri` configuration property (e.g., `urn:localhost:LmxOpcUa:instance1` and `urn:localhost:LmxOpcUa:instance2`). - -When redundancy is disabled, `ApplicationUri` defaults to `urn:{GalaxyName}:LmxOpcUa` if left null. - -## Configuration - -### Redundancy Section - -| Property | Type | Default | Description | -|---|---|---|---| -| `Enabled` | bool | `false` | Enables non-transparent redundancy. When false, the server reports `RedundancySupport.None` and `ServiceLevel = 255`. | -| `Mode` | string | `"Warm"` | The redundancy mode advertised to clients. Valid values: `Warm`, `Hot`. | -| `Role` | string | `"Primary"` | This instance's role in the redundant pair. Valid values: `Primary`, `Secondary`. The Primary advertises a higher ServiceLevel than the Secondary when both are healthy. | -| `ServerUris` | string[] | `[]` | The ApplicationUri values of all servers in the redundant set. Must include this instance's own `OpcUa.ApplicationUri`. Should contain at least 2 entries. | -| `ServiceLevelBase` | int | `200` | The base ServiceLevel when the server is fully healthy. Valid range: 1-255. The Secondary automatically receives `ServiceLevelBase - 50`. | - -### OpcUa.ApplicationUri - -| Property | Type | Default | Description | -|---|---|---|---| -| `ApplicationUri` | string | `null` | Explicit application URI for this server instance. When null, defaults to `urn:{GalaxyName}:LmxOpcUa`. **Required when redundancy is enabled** -- each instance needs a unique identity. | - -## ServiceLevel Computation - -ServiceLevel is a standard OPC UA diagnostic value (0-255) that indicates server health. Clients in a redundant deployment should prefer the server advertising the highest ServiceLevel. - -**Baseline values:** - -| Role | Baseline | +| Class | Role | |---|---| -| Primary | `ServiceLevelBase` (default 200) | -| Secondary | `ServiceLevelBase - 50` (default 150) | +| `RedundancyCoordinator` | Process-singleton; owns the current `RedundancyTopology` loaded from the `ClusterNode` table. `RefreshAsync` re-reads after `sp_PublishGeneration` so operator role swaps take effect without a process restart. CAS-style swap (`Interlocked.Exchange`) means readers always see a coherent snapshot. | +| `RedundancyTopology` | Immutable `(ClusterId, Self, Peers, ServerUriArray, ValidityFlags)` snapshot. | +| `ApplyLeaseRegistry` | Tracks in-progress `sp_PublishGeneration` apply leases keyed on `(ConfigGenerationId, PublishRequestId)`. `await using` the disposable scope guarantees every exit path (success / exception / cancellation) decrements the lease; a stale-lease watchdog force-closes any lease older than `ApplyMaxDuration` (default 10 minutes) so a crashed publisher can't pin the node at `PrimaryMidApply`. | +| `PeerReachabilityTracker` | Maintains last-known reachability for each peer node over two independent probes — OPC UA ping and HTTP `/healthz`. Both must succeed for `peerReachable = true`. | +| `RecoveryStateManager` | Gates transitions out of the `Recovering*` bands; requires dwell + publish-witness satisfaction before allowing a return to nominal. | +| `ServiceLevelCalculator` | Pure function `(role, selfHealthy, peerUa, peerHttp, applyInProgress, recoveryDwellMet, topologyValid, operatorMaintenance) → byte`. | +| `RedundancyStatePublisher` | Orchestrates inputs into the calculator, pushes the resulting byte to the OPC UA `ServiceLevel` variable via an edge-triggered `OnStateChanged` event, and fires `OnServerUriArrayChanged` when the topology's `ServerUriArray` shifts. | -**Penalties applied to the baseline:** +## Data model -| Condition | Penalty | +Per-node redundancy state lives in the Config DB `ClusterNode` table (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/ClusterNode.cs`): + +| Column | Role | |---|---| -| MXAccess disconnected | -100 | -| Galaxy DB unreachable | -50 | -| Both MXAccess and DB down | ServiceLevel forced to 0 | +| `NodeId` | Unique node identity; matches `Node:NodeId` in the server's bootstrap `appsettings.json`. | +| `ClusterId` | Foreign key into `ServerCluster`. | +| `RedundancyRole` | `Primary`, `Secondary`, or `Standalone` (`RedundancyRole` enum in `Configuration/Enums`). | +| `ServiceLevelBase` | Per-node base value used to bias nominal ServiceLevel output. | +| `ApplicationUri` | Unique-per-node OPC UA ApplicationUri advertised in endpoint descriptions. | -The final value is clamped to the range 0-255. +`ServerUriArray` is derived from the set of peer `ApplicationUri` values at topology-load time and republished when the topology changes. -**Examples (with default ServiceLevelBase = 200):** +## ServiceLevel matrix -| Scenario | Primary | Secondary | +`ServiceLevelCalculator` produces one of the following bands (see `ServiceLevelBand` enum in the same file): + +| Band | Byte | Meaning | |---|---|---| -| Both healthy | 200 | 150 | -| MXAccess down | 100 | 50 | -| DB down | 150 | 100 | -| Both down | 0 | 0 | +| `Maintenance` | 0 | Operator-declared maintenance. | +| `NoData` | 1 | Self-reported unhealthy (`/healthz` fails). | +| `InvalidTopology` | 2 | More than one Primary detected; both nodes self-demote. | +| `RecoveringBackup` | 30 | Backup post-fault, dwell not met. | +| `BackupMidApply` | 50 | Backup inside a publish-apply window. | +| `IsolatedBackup` | 80 | Primary unreachable; Backup says "take over if asked" — does **not** auto-promote (non-transparent model). | +| `AuthoritativeBackup` | 100 | Backup nominal. | +| `RecoveringPrimary` | 180 | Primary post-fault, dwell not met. | +| `PrimaryMidApply` | 200 | Primary inside a publish-apply window. | +| `IsolatedPrimary` | 230 | Primary with unreachable peer, retains authority. | +| `AuthoritativePrimary` | 255 | Primary nominal. | -## Two-Instance Deployment +The reserved bands (0 Maintenance, 1 NoData, 2 InvalidTopology) take precedence over operational states per OPC UA Part 5 §6.3.34. Operational values occupy 2..255 so spec-compliant clients that treat "<3 = unhealthy" keep working. -When deploying a redundant pair, the following configuration properties must differ between the two instances. All other settings (GalaxyName, ConnectionString, etc.) are shared. +Standalone nodes (single-instance deployments) report `AuthoritativePrimary` when healthy and `PrimaryMidApply` during publish. -| Property | Instance 1 (Primary) | Instance 2 (Secondary) | -|---|---|---| -| `OpcUa.Port` | 4840 | 4841 | -| `OpcUa.ServerName` | `LmxOpcUa-1` | `LmxOpcUa-2` | -| `OpcUa.ApplicationUri` | `urn:localhost:LmxOpcUa:instance1` | `urn:localhost:LmxOpcUa:instance2` | -| `Dashboard.Port` | 8081 | 8082 | -| `MxAccess.ClientName` | `LmxOpcUa-1` | `LmxOpcUa-2` | -| `Redundancy.Role` | `Primary` | `Secondary` | +## Publish fencing and split-brain prevention -### Instance 1 -- Primary (appsettings.json) +Any Admin-triggered `sp_PublishGeneration` acquires an apply lease through `ApplyLeaseRegistry.BeginApplyLease`. While the lease is held: -```json -{ - "OpcUa": { - "Port": 4840, - "ServerName": "LmxOpcUa-1", - "GalaxyName": "ZB", - "ApplicationUri": "urn:localhost:LmxOpcUa:instance1" - }, - "MxAccess": { - "ClientName": "LmxOpcUa-1" - }, - "Dashboard": { - "Port": 8081 - }, - "Redundancy": { - "Enabled": true, - "Mode": "Warm", - "Role": "Primary", - "ServerUris": [ - "urn:localhost:LmxOpcUa:instance1", - "urn:localhost:LmxOpcUa:instance2" - ], - "ServiceLevelBase": 200 - } -} -``` +- The calculator reports `PrimaryMidApply` / `BackupMidApply` — clients see the band shift and cut over to the unaffected peer rather than racing against a half-applied generation. +- `RedundancyCoordinator.RefreshAsync` is called at the end of the apply window so the post-publish topology becomes visible exactly once, atomically. +- The watchdog force-closes any lease older than `ApplyMaxDuration`; a stuck publisher therefore cannot strand a node at `PrimaryMidApply`. -### Instance 2 -- Secondary (appsettings.json) +Because role transitions are **operator-driven** (write `RedundancyRole` in the Config DB + publish), the Backup never auto-promotes. An `IsolatedBackup` at 80 is the signal that the operator should intervene; auto-failover is intentionally out of scope for the non-transparent model (decision #154). -```json -{ - "OpcUa": { - "Port": 4841, - "ServerName": "LmxOpcUa-2", - "GalaxyName": "ZB", - "ApplicationUri": "urn:localhost:LmxOpcUa:instance2" - }, - "MxAccess": { - "ClientName": "LmxOpcUa-2" - }, - "Dashboard": { - "Port": 8082 - }, - "Redundancy": { - "Enabled": true, - "Mode": "Warm", - "Role": "Secondary", - "ServerUris": [ - "urn:localhost:LmxOpcUa:instance1", - "urn:localhost:LmxOpcUa:instance2" - ], - "ServiceLevelBase": 200 - } -} -``` +## Metrics -## CLI `redundancy` Command +`RedundancyMetrics` in `src/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs` registers the `ZB.MOM.WW.OtOpcUa.Redundancy` meter on the Admin process. Instruments: -The Client CLI includes a `redundancy` command that reads the redundancy state from a running server. +| Name | Kind | Tags | Description | +|---|---|---|---| +| `otopcua.redundancy.role_transition` | Counter | `cluster.id`, `node.id`, `from_role`, `to_role` | Incremented every time `FleetStatusPoller` observes a `RedundancyRole` change on a `ClusterNode` row. | +| `otopcua.redundancy.primary_count` | ObservableGauge | `cluster.id` | Primary-role nodes per cluster — should be exactly 1 in nominal state. | +| `otopcua.redundancy.secondary_count` | ObservableGauge | `cluster.id` | Secondary-role nodes per cluster. | +| `otopcua.redundancy.stale_count` | ObservableGauge | `cluster.id` | Nodes whose `LastSeenAt` exceeded the stale threshold. | -```bash -dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4840/LmxOpcUa -dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- redundancy -u opc.tcp://localhost:4841/LmxOpcUa -``` +Admin `Program.cs` wires OpenTelemetry to the Prometheus exporter when `Metrics:Prometheus:Enabled=true` (default), exposing the meter under `/metrics`. The endpoint is intentionally unauthenticated — fleet conventions put it behind a reverse-proxy basic-auth gate if needed. -The command reads the following standard OPC UA nodes and displays their values: +## Real-time notifications (Admin UI) -- **Redundancy Mode** -- from `Server_ServerRedundancy_RedundancySupport` (None, Warm, or Hot) -- **Service Level** -- from `Server_ServiceLevel` (0-255) -- **Server URIs** -- from `Server_ServerRedundancy_ServerUriArray` (list of ApplicationUri values in the redundant set) -- **Application URI** -- from `Server_ServerArray` (this instance's ApplicationUri) +`FleetStatusPoller` in `src/ZB.MOM.WW.OtOpcUa.Admin/Hubs/` polls the `ClusterNode` table, records role transitions, updates `RedundancyMetrics.SetClusterCounts`, and pushes a `RoleChanged` SignalR event onto `FleetStatusHub` when a transition is observed. `RedundancyTab.razor` subscribes with `_hub.On("RoleChanged", …)` so connected Admin sessions see role swaps the moment they happen. -Example output for a healthy Primary: +## Configuring a redundant pair -``` -Redundancy Mode: Warm -Service Level: 200 -Server URIs: - - urn:localhost:LmxOpcUa:instance1 - - urn:localhost:LmxOpcUa:instance2 -Application URI: urn:localhost:LmxOpcUa:instance1 -``` +Redundancy is configured **in the Config DB, not appsettings.json**. The fields that must differ between the two instances: -The command also supports `--username`/`--password` and `--security` options for authenticated or encrypted connections. +| Field | Location | Instance 1 | Instance 2 | +|---|---|---|---| +| `NodeId` | `appsettings.json` `Node:NodeId` (bootstrap) | `node-a` | `node-b` | +| `ClusterNode.ApplicationUri` | Config DB | `urn:node-a:OtOpcUa` | `urn:node-b:OtOpcUa` | +| `ClusterNode.RedundancyRole` | Config DB | `Primary` | `Secondary` | +| `ClusterNode.ServiceLevelBase` | Config DB | typically 255 | typically 100 | -### Client Failover with `-F` +Shared between instances: `ClusterId`, Config DB connection string, published generation, cluster-level ACLs, UNS hierarchy, driver instances. -All CLI commands support the `-F` / `--failover-urls` flag for automatic client-side failover. When provided, the CLI tries the primary endpoint first and falls back to the listed URLs if the primary is unreachable. +Role swaps, stand-alone promotions, and base-level adjustments all happen through the Admin UI `RedundancyTab` — the operator edits the `ClusterNode` row in a draft generation and publishes. `RedundancyCoordinator.RefreshAsync` picks up the new topology without a process restart. -```bash -# Connect with failover — uses secondary if primary is down -dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa +## Client-side failover -# Subscribe with live failover — reconnects to secondary if primary drops mid-stream -dotnet run -- subscribe -u opc.tcp://localhost:4840/LmxOpcUa -F opc.tcp://localhost:4841/LmxOpcUa \ - -n "ns=1;s=TestMachine_001.MachineID" -``` +The OtOpcUa Client CLI at `src/ZB.MOM.WW.OtOpcUa.Client.CLI` supports `-F` / `--failover-urls` for automatic client-side failover; for long-running subscriptions the CLI monitors session KeepAlive and reconnects to the next available server, recreating the subscription on the new endpoint. See [`Client.CLI.md`](Client.CLI.md) for the command reference. -For long-running commands (`subscribe`), the CLI monitors the session KeepAlive and automatically reconnects to the next available server when the current session drops. The subscription is re-created on the new server. +## Depth reference -## Troubleshooting - -**Mismatched ServerUris between instances** -- Both instances must list the exact same set of ApplicationUri values in `Redundancy.ServerUris`. If they differ, clients may not discover the full redundant set. Check the startup log for the `Redundancy.ServerUris` line on each instance. - -**ServiceLevel stuck at 255** -- This indicates redundancy is not enabled. When `Redundancy.Enabled` is false (the default), the server always reports `ServiceLevel = 255` and `RedundancySupport.None`. Verify that `Redundancy.Enabled` is set to `true` in the configuration and that the configuration section is correctly bound. - -**ApplicationUri not set** -- The configuration validator rejects startup when redundancy is enabled but `OpcUa.ApplicationUri` is null or empty. Each instance must have a unique ApplicationUri. Check the error log for: `OpcUa.ApplicationUri must be set when redundancy is enabled`. - -**Both servers report the same ServiceLevel** -- Verify that one instance has `Redundancy.Role` set to `Primary` and the other to `Secondary`. Both set to `Primary` (or both to `Secondary`) will produce identical baseline values, preventing clients from distinguishing the preferred server. - -**ServerUriArray not readable** -- When `RedundancySupport` is `None` (redundancy disabled), the OPC UA SDK may not expose the `ServerUriArray` node or it may return an empty value. The CLI `redundancy` command handles this gracefully by catching the read error. Enable redundancy to populate this array. +For the full decision trail and implementation plan — topology invariants, peer-probe cadence, recovery-dwell policy, compliance-script guard against enum-value drift — see `docs/v2/plan.md` §Phase 6.3. diff --git a/docs/ServiceHosting.md b/docs/ServiceHosting.md index f9b7735..833ca27 100644 --- a/docs/ServiceHosting.md +++ b/docs/ServiceHosting.md @@ -2,189 +2,132 @@ ## Overview -The service runs as a Windows service or console application using TopShelf for lifecycle management. It targets .NET Framework 4.8 with an x86 (32-bit) platform target, which is required for MXAccess COM interop with the ArchestrA runtime DLLs. +A production OtOpcUa deployment runs **three processes**, each with a distinct runtime, platform target, and install surface: -## TopShelf Configuration +| Process | Project | Runtime | Platform | Responsibility | +|---|---|---|---|---| +| **OtOpcUa Server** | `src/ZB.MOM.WW.OtOpcUa.Server` | .NET 10 | x64 | Hosts the OPC UA endpoint; loads every non-Galaxy driver in-process; exposes `/healthz`. | +| **OtOpcUa Admin** | `src/ZB.MOM.WW.OtOpcUa.Admin` | .NET 10 (ASP.NET Core / Blazor Server) | x64 | Operator UI for Config DB editing + fleet status, SignalR hubs (`FleetStatusHub`, `AlertHub`), Prometheus `/metrics`. | +| **OtOpcUa Galaxy.Host** | `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host` | .NET Framework 4.8 | x86 (32-bit) | Hosts MXAccess COM on a dedicated STA thread with a Win32 message pump; exposes a named-pipe IPC surface consumed by `Driver.Galaxy.Proxy` inside the Server process. | -`Program.Main()` configures TopShelf to manage the `OpcUaService` lifecycle: +The x86 / .NET Framework 4.8 constraint applies **only** to Galaxy.Host because the MXAccess toolkit DLLs (`Program Files (x86)\ArchestrA\Framework\bin`) are 32-bit-only COM. Every other driver (Modbus, S7, OpcUaClient, AbCip, AbLegacy, TwinCAT, FOCAS) runs in-process in the 64-bit Server. + +## Server process + +`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs` uses the generic host: ```csharp -var exitCode = HostFactory.Run(host => -{ - host.UseSerilog(); - - host.Service(svc => - { - svc.ConstructUsing(() => new OpcUaService()); - svc.WhenStarted(s => s.Start()); - svc.WhenStopped(s => s.Stop()); - }); - - host.SetServiceName("LmxOpcUa"); - host.SetDisplayName("LMX OPC UA Server"); - host.SetDescription("OPC UA server exposing System Platform Galaxy tags via MXAccess."); - host.RunAsLocalSystem(); - host.StartAutomatically(); -}); +var builder = Host.CreateApplicationBuilder(args); +builder.Services.AddSerilog(); +builder.Services.AddWindowsService(o => o.ServiceName = "OtOpcUa"); +… +builder.Services.AddHostedService(); +builder.Services.AddHostedService(); ``` -TopShelf provides these deployment modes from the same executable: +`OpcUaServerService` is a `BackgroundService` (decision #30 — TopShelf from v1 was replaced by the generic-host `AddWindowsService` wrapper; no TopShelf dependency remains in any csproj). It owns: -| Command | Description | -|---------|-------------| -| `OtOpcUa.Host.exe` | Run as a console application (foreground) | -| `OtOpcUa.Host.exe install` | Install as a Windows service | -| `OtOpcUa.Host.exe uninstall` | Remove the Windows service | -| `OtOpcUa.Host.exe start` | Start the installed service | -| `OtOpcUa.Host.exe stop` | Stop the installed service | +1. Config bootstrap — reads `Node:NodeId`, `Node:ClusterId`, `Node:ConfigDbConnectionString`, `Node:LocalCachePath` from `appsettings.json`. +2. `NodeBootstrap` — pulls the latest published generation from the Config DB into the LiteDB local cache (`LiteDbConfigCache`) so the node starts even if the central DB is briefly unreachable. +3. `DriverHost` — instantiates configured driver instances from the generation, wires each through `CapabilityInvoker` resilience pipelines. +4. `OpcUaApplicationHost` — builds the OPC UA endpoint, applies `OpcUaServerOptions` + `LdapOptions`, registers `AuthorizationGate` at dispatch. +5. `HostStatusPublisher` — a second hosted service that heartbeats `DriverHostStatus` rows so the Admin UI Fleet view sees the node. -The service is configured to run as `LocalSystem` and start automatically on boot. +### Installation -## Working Directory +Same executable, different modes driven by the .NET generic-host `AddWindowsService` wrapper: -Before configuring Serilog, `Program.Main()` sets the working directory to the executable's location: +| Mode | Invocation | +|---|---| +| Console | `ZB.MOM.WW.OtOpcUa.Server.exe` | +| Install as Windows service | `sc create OtOpcUa binPath="C:\Program Files\OtOpcUa\Server\ZB.MOM.WW.OtOpcUa.Server.exe" start=auto` | +| Start | `sc start OtOpcUa` | +| Stop | `sc stop OtOpcUa` | +| Uninstall | `sc delete OtOpcUa` | -```csharp -Environment.CurrentDirectory = AppDomain.CurrentDomain.BaseDirectory; -``` +### Health endpoints -This is necessary because Windows services default their working directory to `System32`, which would cause relative log paths and `appsettings.json` to resolve incorrectly. +The Server exposes `/healthz` + `/readyz` used by (a) the Admin `FleetStatusPoller` as input to Fleet status and (b) `PeerReachabilityTracker` in a peer Server process as the HTTP side of the peer-reachability probe. -## Startup Sequence +## Admin process -`OpcUaService.Start()` executes the following steps in order. If any required step fails, the service logs the error and throws, preventing a partially initialized state. +`src/ZB.MOM.WW.OtOpcUa.Admin/Program.cs` is a stock `WebApplication`. Highlights: -1. **Load configuration** -- The production constructor reads `appsettings.json`, optional environment overlay, and environment variables, then binds each section to its typed configuration class. -2. **Validate configuration** -- `ConfigurationValidator.ValidateAndLog()` logs all resolved values and checks required constraints (port range, non-empty names and connection strings). If validation fails, the service throws `InvalidOperationException`. -3. **Register exception handler** -- Registers `AppDomain.CurrentDomain.UnhandledException` to log fatal unhandled exceptions with `IsTerminating` context. -4. **Create performance metrics** -- Creates the `PerformanceMetrics` instance and a `CancellationTokenSource` for coordinating shutdown. -5. **Create and connect MXAccess client** -- Starts the STA COM thread, creates the `MxAccessClient`, and attempts an initial connection. If the connection fails, the service logs a warning and continues -- the monitor loop will retry in the background. -6. **Start MXAccess monitor** -- Starts the connectivity monitor loop that probes the runtime connection at the configured interval and handles auto-reconnect. -7. **Test Galaxy repository connection** -- Calls `TestConnectionAsync()` on the Galaxy repository to verify the SQL Server database is reachable. If it fails, the service continues without initial address-space data. -8. **Create OPC UA server host** -- Creates `OpcUaServerHost` with the effective MXAccess client (real, override, or null fallback), performance metrics, and an optional `IHistorianDataSource` obtained from `HistorianPluginLoader.TryLoad` when `Historian.Enabled=true` (returns `null` if the plugin is absent or fails to load). -9. **Query Galaxy hierarchy** -- Fetches the object hierarchy and attribute definitions from the Galaxy repository database, recording object and attribute counts. -10. **Start server and build address space** -- Starts the OPC UA server, retrieves the `LmxNodeManager`, and calls `BuildAddressSpace()` with the queried hierarchy and attributes. If the query or build fails, the server still starts with an empty address space. -11. **Start change detection** -- Creates and starts `ChangeDetectionService`, which polls `galaxy.time_of_last_deploy` at the configured interval. When a change is detected, it triggers an address-space rebuild via the `OnGalaxyChanged` event. -12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled. If `StatusWebServer.Start()` returns `false` (port already bound, insufficient permissions, etc.), the service logs a warning, disposes the unstarted instance, sets `OpcUaService.DashboardStartFailed = true`, and continues in degraded mode. Matches the warning-continue policy applied to MxAccess connect, Galaxy DB connect, and initial address space build. Stability review 2026-04-13 Finding 2. -13. **Log startup complete** -- Logs "LmxOpcUa service started successfully" at `Information` level. +- Cookie auth (`CookieAuthenticationDefaults`, scheme name `OtOpcUa.Admin`) + Blazor Server (`AddInteractiveServerComponents`) + SignalR. +- Authorization policies gated by `AdminRoles`: `ConfigViewer`, `ConfigEditor`, `FleetAdmin` (see `Services/AdminRoles.cs`). `CanEdit` policy requires `ConfigEditor` or `FleetAdmin`; `CanPublish` requires `FleetAdmin`. +- `OtOpcUaConfigDbContext` registered against `ConnectionStrings:ConfigDb`. +- Scoped services: `ClusterService`, `GenerationService`, `EquipmentService`, `UnsService`, `NamespaceService`, `DriverInstanceService`, `NodeAclService`, `PermissionProbeService`, `AclChangeNotifier`, `ReservationService`, `DraftValidationService`, `AuditLogService`, `HostStatusService`, `ClusterNodeService`, `EquipmentImportBatchService`, `ILdapGroupRoleMappingService`. +- Singleton `RedundancyMetrics` (meter name `ZB.MOM.WW.OtOpcUa.Redundancy`) + `CertTrustService` (promotes rejected client certs in the Server's PKI store to trusted via the Admin Certificates page). +- `LdapAuthService` bound to `Authentication:Ldap` — same LDAP flow as ScadaLink CentralUI for visual parity. +- SignalR hubs mapped at `/hubs/fleet` and `/hubs/alerts`; `FleetStatusPoller` runs as a hosted service and pushes `RoleChanged`, host status, and alert events. +- OpenTelemetry → Prometheus exporter at `/metrics` when `Metrics:Prometheus:Enabled=true` (default). Pull-based means no Collector required in the common K8s deploy. -## Shutdown Sequence +### Installation -`OpcUaService.Stop()` tears down components in reverse dependency order: +Deployed as an ASP.NET Core service; the generic-host `AddWindowsService` wrapper (or IIS reverse-proxy for multi-node fleets) provides install/uninstall. Listens on whatever `ASPNETCORE_URLS` specifies. -1. **Cancel operations** -- Signals the `CancellationTokenSource` to stop all background loops. -2. **Stop change detection** -- Stops the Galaxy deploy polling loop. -3. **Stop OPC UA server** -- Shuts down the OPC UA server host, disconnecting all client sessions. -4. **Stop MXAccess monitor** -- Stops the connectivity monitor loop. -5. **Disconnect MXAccess** -- Disconnects the MXAccess client and releases COM resources. -6. **Dispose STA thread** -- Shuts down the dedicated STA COM thread and its message pump. -7. **Stop dashboard** -- Disposes the `StatusWebServer` HTTP listener. -8. **Dispose metrics** -- Releases the performance metrics collector. -9. **Dispose change detection** -- Releases the change detection service. -10. **Unregister exception handler** -- Removes the `AppDomain.UnhandledException` handler. +## Galaxy.Host process -The entire shutdown is wrapped in a `try/catch` that logs warnings for errors during cleanup, ensuring the service exits even if a component fails to dispose cleanly. +`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Program.cs` is a .NET Framework 4.8 x86 console executable. Configuration comes from environment variables supplied by the supervisor (`Driver.Galaxy.Proxy.Supervisor`): -## Error Handling +| Env var | Purpose | +|---|---| +| `OTOPCUA_GALAXY_PIPE` | Pipe name the host listens on (default `OtOpcUaGalaxy`). | +| `OTOPCUA_ALLOWED_SID` | SID of the Server process's principal; anyone else is refused during the handshake. | +| `OTOPCUA_GALAXY_SECRET` | Per-spawn shared secret the client must present in the Hello frame. | +| `OTOPCUA_GALAXY_BACKEND` | `mxaccess` (default), `db` (ZB-only, no COM), `stub` (in-memory; for tests). | +| `OTOPCUA_GALAXY_ZB_CONN` | SQL connection string to the ZB Galaxy repository. | +| `OTOPCUA_HISTORIAN_*` | Optional Wonderware Historian SDK config if Historian is enabled for this node. | -### Unhandled exceptions +The host spins up `StaPump` (the STA thread with message pump), creates the MXAccess `LMXProxyServer` COM object on that thread, and handles all COM calls there; the IPC layer marshals work items via `PostThreadMessage`. -`AppDomain.CurrentDomain.UnhandledException` is registered at startup and removed at shutdown. The handler logs the exception at `Fatal` level with the `IsTerminating` flag: +### Pipe security -```csharp -Log.Fatal(e.ExceptionObject as Exception, - "Unhandled exception (IsTerminating={IsTerminating})", e.IsTerminating); -``` +`PipeServer` builds a `PipeAcl` from the provided `SecurityIdentifier` + uses `NamedPipeServerStream` with `maxNumberOfServerInstances: 1`. The handshake requires a matching shared secret in the first Hello frame; callers whose SID doesn't match `OTOPCUA_ALLOWED_SID` are rejected before any frame is processed. **By design the pipe ACL denies BUILTIN\Administrators** — live smoke tests must therefore run from a non-elevated shell that matches the allowed principal. The installed dev host (`OtOpcUaGalaxyHost`) runs as `dohertj2` with the secret at `.local/galaxy-host-secret.txt`. -### Startup resilience +### Installation -The startup sequence is designed to degrade gracefully rather than fail entirely: - -- If MXAccess connection fails, the service continues with a `NullMxAccessClient` that returns bad-quality values for all reads. -- If the Galaxy repository database is unreachable, the OPC UA server starts with an empty address space. -- If the status dashboard port is in use, the dashboard logs a warning and does not start, but the OPC UA server continues. - -### Fatal startup failure - -If a critical step (configuration validation, OPC UA server start) throws, `Start()` catches the exception, logs it at `Fatal`, and re-throws to let TopShelf report the failure. - -## Logging - -The service uses Serilog with two sinks configured in `Program.Main()`: - -```csharp -Log.Logger = new LoggerConfiguration() - .MinimumLevel.Information() - .WriteTo.Console() - .WriteTo.File( - path: "logs/lmxopcua-.log", - rollingInterval: RollingInterval.Day, - retainedFileCountLimit: 31) - .CreateLogger(); -``` - -| Sink | Details | -|------|---------| -| Console | Writes to stdout, useful when running as a console application | -| Rolling file | Writes to `logs/lmxopcua-{date}.log`, rolls daily, retains 31 days of history | - -Log files are written relative to the executable directory (see Working Directory above). Each component creates its own contextual logger using `Log.ForContext()` or `Log.ForContext(typeof(T))`. - -`Log.CloseAndFlush()` is called in the `finally` block of `Program.Main()` to ensure all buffered log entries are written before process exit. - -## Multi-Instance Deployment - -The service supports running multiple instances for redundancy. Each instance requires: - -- A unique Windows service name (e.g., `LmxOpcUa`, `LmxOpcUa2`) -- A unique OPC UA port and dashboard port -- A unique `OpcUa.ApplicationUri` and `OpcUa.ServerName` -- A unique `MxAccess.ClientName` -- Matching `Redundancy.ServerUris` arrays on all instances - -Install additional instances using TopShelf's `-servicename` flag: +NSSM-wrapped (the Non-Sucking Service Manager) because the executable itself is a plain console app, not a `ServiceBase` Windows service. The supervisor then adopts the child process over the pipe after install. Install/uninstall commands follow the NSSM pattern: ```bash -cd C:\publish\lmxopcua\instance2 -ZB.MOM.WW.OtOpcUa.Host.exe install -servicename "LmxOpcUa2" -displayname "LMX OPC UA Server (Instance 2)" +nssm install OtOpcUaGalaxyHost "C:\Program Files (x86)\OtOpcUa\Galaxy.Host\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.exe" +nssm set OtOpcUaGalaxyHost ObjectName .\dohertj2 +nssm set OtOpcUaGalaxyHost AppEnvironmentExtra OTOPCUA_GALAXY_BACKEND=mxaccess OTOPCUA_GALAXY_SECRET=… OTOPCUA_ALLOWED_SID=… +nssm start OtOpcUaGalaxyHost ``` -See [Redundancy Guide](Redundancy.md) for full deployment details. +(Exact values for the environment block are generated by the Admin UI + committed alongside `.local/galaxy-host-secret.txt` on the dev box.) -## Required Runtime Assemblies - -The build uses Costura.Fody to embed all NuGet dependencies into the single `ZB.MOM.WW.OtOpcUa.Host.exe`. The only native dependency that must sit alongside the executable in every deployment is the MXAccess COM toolkit: - -| Assembly | Purpose | -|----------|---------| -| `ArchestrA.MxAccess.dll` | MXAccess COM interop — runtime data access to Galaxy tags | - -The Wonderware Historian SDK is packaged as a **runtime-loaded plugin** so hosts that will not use historical data access do not need the SDK installed. The plugin lives in a `Historian/` subfolder next to `ZB.MOM.WW.OtOpcUa.Host.exe`: +## Inter-process communication ``` -ZB.MOM.WW.OtOpcUa.Host.exe -ArchestrA.MxAccess.dll -Historian/ - ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll - aahClientManaged.dll - aahClientCommon.dll - aahClient.dll - Historian.CBE.dll - Historian.DPAPI.dll - ArchestrA.CloudHistorian.Contract.dll +┌──────────────────────────┐ LDAP bind (Authentication:Ldap) ┌──────────────────────────┐ +│ OtOpcUa Admin (x64) │ ─────────────────────────────────────────────▶│ LDAP / AD │ +│ Blazor Server + SignalR │ └──────────────────────────┘ +│ /metrics (Prometheus) │ FleetStatusPoller → ClusterNode poll +│ │ ─────────────────────────────────────────────▶┌──────────────────────────┐ +│ │ Cluster/Generation/ACL writes │ Config DB (SQL Server) │ +└──────────────────────────┘ ─────────────────────────────────────────────▶│ OtOpcUaConfigDbContext │ + ▲ └──────────────────────────┘ + │ SignalR ▲ + │ (role change, │ sp_GetCurrentGenerationForCluster + │ host status, │ sp_PublishGeneration + │ alerts) │ +┌──────────────────────────┐ │ +│ OtOpcUa Server (x64) │ ──────────────────────────────────────────────────────────┘ +│ OPC UA endpoint │ +│ Non-Galaxy drivers │ Named pipe (OtOpcUaGalaxy) ┌──────────────────────────┐ +│ Driver.Galaxy.Proxy │ ─────────────────────────────────────────────▶│ Galaxy.Host (x86 .NFx) │ +│ │ SID + shared-secret handshake │ STA + message pump │ +│ /healthz /readyz │ │ MXAccess COM │ +└──────────────────────────┘ │ Historian SDK (opt) │ + └──────────────────────────┘ ``` -At startup, if `Historian.Enabled=true` in `appsettings.json`, `HistorianPluginLoader` probes `Historian/ZB.MOM.WW.OtOpcUa.Historian.Aveva.dll` via `Assembly.LoadFrom` and instantiates the plugin's entry point. An `AppDomain.AssemblyResolve` handler redirects the SDK assembly lookups (`aahClientManaged`, `aahClientCommon`, …) to the same subfolder so the CLR can resolve them when the plugin first JITs. If the plugin directory is absent or any SDK dependency fails to load, the loader logs a warning and the server continues to run with history support disabled — `LmxNodeManager` returns `BadHistoryOperationUnsupported` for every history call. +## appsettings.json boundary -Deployment matrix: +Each process reads its own `appsettings.json` for **bootstrap only** — connection strings, LDAP bind config, transport security profile, redundancy node id, logging. The authoritative configuration tree (drivers, UNS, tags, ACLs) lives in the Config DB and is edited through the Admin UI. See [`Configuration.md`](Configuration.md) for the split. -| Scenario | Host exe | `ArchestrA.MxAccess.dll` | `Historian/` subfolder | -|----------|----------|--------------------------|------------------------| -| `Historian.Enabled=false` | required | required | **omit** | -| `Historian.Enabled=true` | required | required | required | +## Development bootstrap -`ArchestrA.MxAccess.dll` and the historian SDK DLLs are not redistributable — they are provided by the AVEVA System Platform and Historian installations on the target machine. The copies in `lib/` are taken from `Program Files (x86)\ArchestrA\Framework\bin` on a machine with the platform installed. - -## Platform Target - -The service must be compiled and run as x86 (32-bit). The MXAccess COM toolkit DLLs in `Program Files (x86)\ArchestrA\Framework\bin` are 32-bit only. Running the service as x64 or AnyCPU (64-bit preferred) causes COM interop failures when creating the `LMXProxyServer` object on the STA thread. +For the Windows install steps (SQL Server in Docker, .NET 10 SDK, .NET Framework 4.8 SDK, Docker Desktop WSL 2 backend, EF Core CLI, first-run migration), see [`docs/v2/dev-environment.md`](v2/dev-environment.md). diff --git a/docs/StatusDashboard.md b/docs/StatusDashboard.md index 75896e0..5050c59 100644 --- a/docs/StatusDashboard.md +++ b/docs/StatusDashboard.md @@ -1,274 +1,16 @@ -# Status Dashboard +# Status Dashboard — Superseded -## Overview +This document has been superseded. -The service hosts an embedded HTTP status dashboard that surfaces real-time health, connection state, subscription counts, data change throughput, and Galaxy metadata. Operators access it through a browser to verify the bridge is functioning without needing an OPC UA client. The dashboard is enabled by default on port 8081 and can be disabled via configuration. +The single-process, HTTP-listener "Status Dashboard" (`StatusWebServer` bound to port 8081) belonged to v1 LmxOpcUa, where one process owned the OPC UA endpoint, the MXAccess bridge, and the operator surface. In the multi-process OtOpcUa platform the operator surface has moved into the **OtOpcUa Admin** app — a Blazor Server UI that talks to the shared Config DB and to every deployed node over SignalR (`FleetStatusHub`, `AlertHub`). Prometheus scraping lives on the Admin app's `/metrics` endpoint via OpenTelemetry (`Metrics:Prometheus:Enabled`). -## HTTP Server +Operator surfaces now covered by the Admin UI: -`StatusWebServer` wraps a `System.Net.HttpListener` bound to `http://+:{port}/`. It starts a background task that accepts requests in a loop and dispatches them by path. Only `GET` requests are accepted; all other methods return `405 Method Not Allowed`. Responses include `Cache-Control: no-cache` headers to prevent stale data in the browser. +- Fleet health, per-node role/ServiceLevel, crash-loop detection (`Fleet.razor`, `Hosts.razor`, `FleetStatusPoller`) +- Redundancy state + role transitions (`RedundancyMetrics`, `otopcua.redundancy.*`) +- Cluster + node + credential management (`ClusterService`, `ClusterNodeService`) +- Draft/publish generation editor, diff viewer, CSV import, UnsTab, IdentificationFields, RedundancyTab, AclsTab with Probe-this-permission +- Certificate trust management (`CertTrustService` promotes rejected client certs to trusted) +- Audit log viewer (`AuditLogService`) -### Endpoints - -| Path | Content-Type | Description | -|------|-------------|-------------| -| `/` | `text/html` | Operator dashboard with auto-refresh | -| `/health` | `text/html` | Focused health page with service-level badge and component cards | -| `/api/status` | `application/json` | Full status snapshot as JSON (`StatusData`) | -| `/api/health` | `application/json` | Health endpoint (`HealthEndpointData`) -- returns `503` when status is `Unhealthy`, `200` otherwise | - -Any other path returns `404 Not Found`. - -## Health Check Logic - -`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed: - -1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state. -2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available. -3. **Rule 2 / 2c -- Degraded**: Any recorded operation has a low success rate. The sample threshold depends on the operation category: - - Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate. - - Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads. -4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts. -5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case. -6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational." - -The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection. - -## Status Data Model - -`StatusReportService` aggregates data from all bridge components into a `StatusData` DTO, which is then rendered as HTML or serialized to JSON. The DTO contains the following sections: - -### Connection - -| Field | Type | Description | -|-------|------|-------------| -| `State` | `string` | Current MXAccess connection state (Connected, Disconnected, Connecting) | -| `ReconnectCount` | `int` | Number of reconnect attempts since startup | -| `ActiveSessions` | `int` | Number of active OPC UA client sessions | - -### Health - -| Field | Type | Description | -|-------|------|-------------| -| `Status` | `string` | Healthy, Degraded, or Unhealthy | -| `Message` | `string` | Operator-facing explanation | -| `Color` | `string` | CSS color token (green, yellow, red, gray) | - -### Subscriptions - -| Field | Type | Description | -|-------|------|-------------| -| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) | -| `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load | - -### Galaxy - -| Field | Type | Description | -|-------|------|-------------| -| `GalaxyName` | `string` | Name of the Galaxy being bridged | -| `DbConnected` | `bool` | Whether the Galaxy repository database is reachable | -| `LastDeployTime` | `DateTime?` | Most recent deploy timestamp from the Galaxy | -| `ObjectCount` | `int` | Number of Galaxy objects in the address space | -| `AttributeCount` | `int` | Number of Galaxy attributes as OPC UA variables | -| `LastRebuildTime` | `DateTime?` | UTC timestamp of the last completed address-space rebuild | - -### Data change - -| Field | Type | Description | -|-------|------|-------------| -| `EventsPerSecond` | `double` | Rate of MXAccess data change events per second | -| `AvgBatchSize` | `double` | Average items processed per dispatch cycle | -| `PendingItems` | `int` | Items waiting in the dispatch queue | -| `TotalEvents` | `long` | Total MXAccess data change events since startup | - -### Galaxy Runtime - -Populated from the `GalaxyRuntimeProbeManager` that advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`. - -| Field | Type | Description | -|-------|------|-------------| -| `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) | -| `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality | -| `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state | -| `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected | -| `Hosts` | `List` | Per-host detail rows, sorted alphabetically by `ObjectName` | - -Each `GalaxyRuntimeStatus` entry: - -| Field | Type | Description | -|-------|------|-------------| -| `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) | -| `GobjectId` | `int` | Galaxy `gobject_id` of the host | -| `Kind` | `string` | `$WinPlatform` or `$AppEngine` | -| `State` | `enum` | `Unknown`, `Running`, or `Stopped` | -| `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad | -| `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column | -| `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback | -| `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery | -| `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks | -| `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses | - -The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`). - -### Operations - -A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains: - -- `TotalCount` -- total invocations -- `SuccessRate` -- fraction of successful operations -- `AverageMilliseconds`, `MinMilliseconds`, `MaxMilliseconds`, `Percentile95Milliseconds` -- latency distribution - -The instrumented operation names are: - -| Name | Source | -|---|---| -| `Read` | MXAccess live tag reads (`MxAccessClient.ReadWrite.cs`) | -| `Write` | MXAccess live tag writes | -| `Subscribe` | MXAccess subscription attach | -| `HistoryReadRaw` | `LmxNodeManager.HistoryReadRawModified` -> historian plugin | -| `HistoryReadProcessed` | `LmxNodeManager.HistoryReadProcessed` -> historian plugin (aggregates) | -| `HistoryReadAtTime` | `LmxNodeManager.HistoryReadAtTime` -> historian plugin (interpolated) | -| `HistoryReadEvents` | `LmxNodeManager.HistoryReadEvents` -> historian plugin (alarm/event history) | -| `AlarmAcknowledge` | `LmxNodeManager.OnAlarmAcknowledge` -> MXAccess AckMsg write | - -New operation names are auto-registered on first use, so the `Operations` dictionary only contains entries for features that have actually been exercised since startup. - -### Historian - -`HistorianStatusInfo` -- reflects the outcome of the runtime-loaded historian plugin and the runtime query-health counters. See [Historical Data Access](HistoricalDataAccess.md) for the plugin architecture and the [Runtime Health Counters](HistoricalDataAccess.md#runtime-health-counters) section for the data source instrumentation. - -| Field | Type | Description | -|-------|------|-------------| -| `Enabled` | `bool` | Whether `Historian.Enabled` is set in configuration | -| `PluginStatus` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` — load-time outcome from `HistorianPluginLoader.LastOutcome` | -| `PluginError` | `string?` | Exception message from the last load attempt when `PluginStatus=LoadFailed`; otherwise `null` | -| `PluginPath` | `string` | Absolute path the loader probed for the plugin assembly | -| `ServerName` | `string` | Legacy single-node hostname from `Historian.ServerName`; ignored when `ServerNames` is non-empty | -| `Port` | `int` | Configured historian TCP port | -| `QueryTotal` | `long` | Total historian read queries attempted since startup (raw + aggregate + at-time + events) | -| `QuerySuccesses` | `long` | Queries that completed without an exception | -| `QueryFailures` | `long` | Queries that raised an exception — each failure also triggers the plugin's reconnect path | -| `ConsecutiveFailures` | `int` | Failures since the last success. Resets to zero on any successful query. Drives the `Degraded` health rule at threshold 3 | -| `LastSuccessTime` | `DateTime?` | UTC timestamp of the most recent successful query, or `null` when no query has succeeded since startup | -| `LastFailureTime` | `DateTime?` | UTC timestamp of the most recent failure | -| `LastQueryError` | `string?` | Exception message from the most recent failure. Prefixed with the read-path name (`raw:`, `aggregate:`, `at-time:`, `events:`) so operators can tell which SDK call failed | -| `ProcessConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **process** silo (historical value queries — `ReadRaw`, `ReadAggregate`, `ReadAtTime`). See [Two SDK connection silos](HistoricalDataAccess.md#two-sdk-connection-silos) | -| `EventConnectionOpen` | `bool` | Whether the plugin currently holds an open SDK connection for the **event** silo (alarm history queries — `ReadEvents`). Separate from the process connection because the SDK requires distinct query channels | -| `ActiveProcessNode` | `string?` | Cluster node currently serving the process silo, or `null` when no process connection is open | -| `ActiveEventNode` | `string?` | Cluster node currently serving the event silo, or `null` when no event connection is open | -| `NodeCount` | `int` | Total configured historian cluster nodes. 1 for a legacy single-node deployment | -| `HealthyNodeCount` | `int` | Nodes currently eligible for new connections (not in failure cooldown) | -| `Nodes` | `List` | Per-node cluster state in configuration order. Each entry carries `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime` | - -The operator dashboard renders a cluster table inside the Historian panel when `NodeCount > 1`. Legacy single-node deployments render a compact `Node: ` line and no table. Panel color reflects combined load-time + runtime health: green when everything is fine, yellow when any cluster node is in cooldown or 1-4 consecutive query failures are accumulated, red when the plugin is unloaded / all cluster nodes are failed / 5+ consecutive failures. - -### Alarms - -`AlarmStatusInfo` -- surfaces alarm-condition tracking health and dispatch counters. - -| Field | Type | Description | -|-------|------|-------------| -| `TrackingEnabled` | `bool` | Whether `OpcUa.AlarmTrackingEnabled` is set in configuration | -| `ConditionCount` | `int` | Number of distinct alarm conditions currently tracked | -| `ActiveAlarmCount` | `int` | Number of alarms currently in the `InAlarm=true` state | -| `TransitionCount` | `long` | Total `InAlarm` transitions observed in the dispatch loop since startup | -| `AckEventCount` | `long` | Total alarm acknowledgement transitions observed since startup | -| `AckWriteFailures` | `long` | Total MXAccess AckMsg writes that have failed while processing alarm acknowledges. Any non-zero value latches the service into Degraded (see Rule 2d). | -| `FilterEnabled` | `bool` | Whether `OpcUa.AlarmFilter.ObjectFilters` has any patterns configured | -| `FilterPatternCount` | `int` | Number of compiled filter patterns (after comma-splitting and trimming) | -| `FilterIncludedObjectCount` | `int` | Number of Galaxy objects included by the filter during the most recent address-space build. Zero when the filter is disabled. | - -When the filter is active, the operator dashboard's Alarms panel renders an extra line `Filter: N pattern(s), M object(s) included` so operators can verify scope at a glance. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter) for the matching rules and resolution algorithm. - -### Redundancy - -`RedundancyInfo` -- only populated when `Redundancy.Enabled=true` in configuration. Shows mode, role, computed service level, application URI, and the set of peer server URIs. See [Redundancy](Redundancy.md) for the full guide. - -### Footer - -| Field | Type | Description | -|-------|------|-------------| -| `Timestamp` | `DateTime` | UTC time when the snapshot was generated | -| `Version` | `string` | Service assembly version | - -## `/api/health` Payload - -The health endpoint returns a `HealthEndpointData` document distinct from the full dashboard snapshot. It is designed for load balancers and external monitoring probes that only need an up/down signal plus component-level detail: - -| Field | Type | Description | -|-------|------|-------------| -| `Status` | `string` | `Healthy`, `Degraded`, or `Unhealthy` (drives the HTTP status code) | -| `ServiceLevel` | `byte` | OPC UA-style 0-255 service level. 255 when healthy non-redundant; 0 when MXAccess is down; redundancy-adjusted otherwise | -| `RedundancyEnabled` | `bool` | Whether redundancy is configured | -| `RedundancyRole` | `string?` | `Primary` or `Secondary` when redundancy is enabled; `null` otherwise | -| `RedundancyMode` | `string?` | `Warm` or `Hot` when redundancy is enabled; `null` otherwise | -| `Components.MxAccess` | `string` | `Connected` or `Disconnected` | -| `Components.Database` | `string` | `Connected` or `Disconnected` | -| `Components.OpcUaServer` | `string` | `Running` or `Stopped` | -| `Components.Historian` | `string` | `Disabled`, `NotFound`, `LoadFailed`, or `Loaded` -- matches `HistorianStatusInfo.PluginStatus` | -| `Components.Alarms` | `string` | `Disabled` or `Enabled` -- mirrors `OpcUa.AlarmTrackingEnabled` | -| `Uptime` | `string` | Formatted service uptime (e.g., `3d 5h 20m`) | -| `Timestamp` | `DateTime` | UTC time the snapshot was generated | - -Monitoring tools should: - -- Alert on `Status=Unhealthy` (HTTP 503) for hard outages. -- Alert on `Status=Degraded` (HTTP 200) for latched or cumulative failures -- a degraded status means the server is still operating but a subsystem needs attention (historian plugin missing, alarm ack writes failing, history read error rate too high, etc.). - -## HTML Dashboards - -### `/` -- Operator dashboard - -Monospace, dark background, color-coded panels. Panels: Connection, Health, Redundancy (when enabled), Subscriptions, Data Change Dispatch, Galaxy Info, **Historian**, **Alarms**, Operations (table), Footer. Each panel border color reflects component state (green, yellow, red, or gray). - -The page includes a `` tag set to the configured `RefreshIntervalSeconds` (default 10 seconds), so the browser polls automatically without JavaScript. - -### `/health` -- Focused health view - -Large status badge, computed `ServiceLevel` value, redundancy summary (when enabled), and a row of component cards: MXAccess, Galaxy Database, OPC UA Server, **Historian**, **Alarm Tracking**. Each card turns red when its component is in a failure state and grey when disabled. Best for wallboards and quick at-a-glance monitoring. - -## Configuration - -The dashboard is configured through the `Dashboard` section in `appsettings.json`: - -```json -{ - "Dashboard": { - "Enabled": true, - "Port": 8081, - "RefreshIntervalSeconds": 10 - } -} -``` - -Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `StatusReportService` is still created so that other components can query health programmatically, but no HTTP listener is opened. - -### Dashboard start failures are non-fatal - -If the dashboard is enabled but the configured port is already bound (e.g., a previous instance did not clean up, another service is squatting on the port, or the user lacks URL-reservation rights), `StatusWebServer.Start()` logs the listener exception at Error level and returns `false`. `OpcUaService` then logs a Warning, disposes the unstarted instance, sets `DashboardStartFailed = true`, and continues in degraded mode — the OPC UA endpoint still starts. Operators can detect the failure by searching the service log for: - -``` -[WRN] Status dashboard failed to bind on port {Port}; service continues without dashboard -``` - -Stability review 2026-04-13 Finding 2. - -## Component Wiring - -`StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b: - -```csharp -StatusReportInstance.SetComponents( - effectiveMxClient, - Metrics, - GalaxyStatsInstance, - ServerHost, - NodeManagerInstance, - _config.Redundancy, - _config.OpcUa.ApplicationUri, - _config.Historian); -``` - -This deferred wiring allows the report service to be constructed before the MXAccess client or node manager are fully initialized. If a component is `null`, the report service falls back to default values (e.g., `ConnectionState.Disconnected`, zero counts, `HistorianPluginStatus.Disabled`). - -The historian plugin status is sourced from `HistorianPluginLoader.LastOutcome`, which is updated on every load attempt. `OpcUaService` explicitly calls `HistorianPluginLoader.MarkDisabled()` when `Historian.Enabled=false` so the dashboard can distinguish "feature off" from "load failed" without ambiguity. +See [`docs/v2/admin-ui.md`](v2/admin-ui.md) for the current operator surface and [`docs/ServiceHosting.md`](ServiceHosting.md) for the three-process layout. diff --git a/docs/security.md b/docs/security.md index c185a56..7a726b4 100644 --- a/docs/security.md +++ b/docs/security.md @@ -1,15 +1,28 @@ -# Transport Security +# Security -## Overview +OtOpcUa has four independent security concerns. This document covers all four: -The LmxOpcUa server supports configurable transport security profiles that control how data is protected on the wire between OPC UA clients and the server. +1. **Transport security** — OPC UA secure channel (signing, encryption, X.509 trust). +2. **OPC UA authentication** — Anonymous / UserName / X.509 session identities; UserName tokens authenticated by LDAP bind. +3. **Data-plane authorization** — who can browse, read, subscribe, write, acknowledge alarms on which nodes. Evaluated by `PermissionTrie` against the Config DB `NodeAcl` tree. +4. **Control-plane authorization** — who can view or edit fleet configuration in the Admin UI. Gated by the `AdminRole` (`ConfigViewer` / `ConfigEditor` / `FleetAdmin`) claim from `LdapGroupRoleMapping`. + +Transport security and OPC UA authentication are per-node concerns configured in the Server's bootstrap `appsettings.json`. Data-plane ACLs and Admin role grants live in the Config DB. + +--- + +## Transport Security + +### Overview + +The OtOpcUa Server supports configurable OPC UA transport security profiles that control how data is protected on the wire between OPC UA clients and the server. There are two distinct layers of security in OPC UA: -- **Transport security** -- secures the communication channel itself using TLS-style certificate exchange, message signing, and encryption. This is what the `Security` configuration section controls. -- **UserName token encryption** -- protects user credentials (username/password) sent during session activation. The OPC UA stack encrypts UserName tokens using the server's application certificate regardless of the transport security mode. This means UserName authentication works on `None` endpoints too — the credentials themselves are always encrypted. However, a secure transport profile adds protection against message-level tampering and eavesdropping of data payloads. +- **Transport security** -- secures the communication channel itself using TLS-style certificate exchange, message signing, and encryption. This is what the `OpcUaServer:SecurityProfile` setting controls. +- **UserName token encryption** -- protects user credentials (username/password) sent during session activation. The OPC UA stack encrypts UserName tokens using the server's application certificate regardless of the transport security mode. UserName authentication therefore works on `None` endpoints too — the credentials themselves are always encrypted. A secure transport profile adds protection against message-level tampering and eavesdropping of data payloads. -## Supported Security Profiles +### Supported security profiles The server supports seven transport security profiles: @@ -23,334 +36,88 @@ The server supports seven transport security profiles: | `Aes256_Sha256_RsaPss-Sign` | Aes256_Sha256_RsaPss | Sign | Strongest profile with AES-256 and RSA-PSS signatures. | | `Aes256_Sha256_RsaPss-SignAndEncrypt` | Aes256_Sha256_RsaPss | SignAndEncrypt | Strongest profile. Recommended for high-security deployments. | -Multiple profiles can be enabled simultaneously. The server exposes a separate endpoint for each configured profile, and clients select the one they prefer during connection. +The server exposes a separate endpoint for each configured profile, and clients select the one they prefer during connection. -If no valid profiles are configured (or all names are unrecognized), the server falls back to `None` with a warning in the log. +### Configuration -## Configuration - -Transport security is configured in the `Security` section of `appsettings.json`: +Transport security is configured in the `OpcUaServer` section of the Server process's bootstrap `appsettings.json`: ```json { - "Security": { - "Profiles": ["None"], - "AutoAcceptClientCertificates": true, - "RejectSHA1Certificates": true, - "MinimumCertificateKeySize": 2048, - "PkiRootPath": null, - "CertificateSubject": null + "OpcUaServer": { + "EndpointUrl": "opc.tcp://0.0.0.0:4840/OtOpcUa", + "ApplicationName": "OtOpcUa Server", + "ApplicationUri": "urn:node-a:OtOpcUa", + "PkiStoreRoot": "C:/ProgramData/OtOpcUa/pki", + "AutoAcceptUntrustedClientCertificates": false, + "SecurityProfile": "Basic256Sha256-SignAndEncrypt" } } ``` -### Properties +The server certificate is auto-generated on first start if none exists in `PkiStoreRoot/own/`. Always generated even for `None`-only deployments because UserName token encryption depends on it. -| Property | Type | Default | Description | -|--------------------------------|------------|--------------------------------------------------|-------------| -| `Profiles` | `string[]` | `["None"]` | List of security profile names to expose as server endpoints. Valid values: `None`, `Basic256Sha256-Sign`, `Basic256Sha256-SignAndEncrypt`, `Aes128_Sha256_RsaOaep-Sign`, `Aes128_Sha256_RsaOaep-SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign`, `Aes256_Sha256_RsaPss-SignAndEncrypt`. Profile names are case-insensitive. Duplicates are ignored. | -| `AutoAcceptClientCertificates` | `bool` | `true` | When `true`, the server automatically trusts client certificates that are not already in the trusted store. Set to `false` in production for explicit trust management. | -| `RejectSHA1Certificates` | `bool` | `true` | When `true`, client certificates signed with SHA-1 are rejected. SHA-1 is considered cryptographically weak. | -| `MinimumCertificateKeySize` | `int` | `2048` | Minimum RSA key size (in bits) required for client certificates. Certificates with shorter keys are rejected. | -| `PkiRootPath` | `string?` | `null` (defaults to `%LOCALAPPDATA%\OPC Foundation\pki`) | Override for the PKI root directory where certificates are stored. When `null`, uses the OPC Foundation default location. | -| `CertificateSubject` | `string?` | `null` (defaults to `CN={ServerName}, O=ZB MOM, DC=localhost`) | Override for the server certificate subject name. When `null`, the subject is derived from the configured `ServerName`. | - -### Example: Development (no security) - -```json -{ - "Security": { - "Profiles": ["None"], - "AutoAcceptClientCertificates": true - } -} -``` - -### Example: Production (encrypted only) - -```json -{ - "Security": { - "Profiles": ["Basic256Sha256-SignAndEncrypt"], - "AutoAcceptClientCertificates": false, - "RejectSHA1Certificates": true, - "MinimumCertificateKeySize": 2048 - } -} -``` - -### Example: Mixed (sign and encrypt endpoints, no plaintext) - -```json -{ - "Security": { - "Profiles": ["Basic256Sha256-Sign", "Basic256Sha256-SignAndEncrypt"], - "AutoAcceptClientCertificates": false - } -} -``` - -## PKI Directory Layout - -The server stores certificates in a directory-based PKI store. The default root is: +### PKI directory layout ``` -%LOCALAPPDATA%\OPC Foundation\pki\ -``` - -This can be overridden with the `PkiRootPath` setting. The directory structure is: - -``` -pki/ +{PkiStoreRoot}/ own/ Server's own application certificate and private key issuer/ CA certificates that issued trusted client certificates trusted/ Explicitly trusted client (peer) certificates rejected/ Certificates that were presented but not trusted ``` -### Certificate Trust Flow +### Certificate trust flow When a client connects using a secure profile (`Sign` or `SignAndEncrypt`), the following trust evaluation occurs: 1. The client presents its application certificate during the secure channel handshake. 2. The server checks whether the certificate exists in the `trusted/` store. -3. If found, the connection proceeds (subject to key size and SHA-1 checks). -4. If not found and `AutoAcceptClientCertificates` is `true`, the certificate is automatically copied to `trusted/` and the connection proceeds. -5. If not found and `AutoAcceptClientCertificates` is `false`, the certificate is copied to `rejected/` and the connection is refused. -6. Regardless of trust status, the certificate must meet the `MinimumCertificateKeySize` requirement and pass the SHA-1 check (if `RejectSHA1Certificates` is `true`). +3. If found, the connection proceeds. +4. If not found and `AutoAcceptUntrustedClientCertificates` is `true`, the certificate is automatically copied to `trusted/` and the connection proceeds. +5. If not found and `AutoAcceptUntrustedClientCertificates` is `false`, the certificate is copied to `rejected/` and the connection is refused. -On first startup with a secure profile, the server automatically generates a self-signed application certificate in the `own/` directory if one does not already exist. +The Admin UI `Certificates.razor` page uses `CertTrustService` (singleton reading `CertTrustOptions` for the Server's `PkiStoreRoot`) to promote rejected client certs to trusted without operators having to file-copy manually. -## Production Hardening +### Production hardening -The default settings prioritize ease of development. Before deploying to production, apply the following changes: - -### 1. Disable automatic certificate acceptance - -Set `AutoAcceptClientCertificates` to `false` so that only explicitly trusted client certificates are accepted: - -```json -{ - "Security": { - "AutoAcceptClientCertificates": false - } -} -``` - -After changing this setting, you must manually copy each client's application certificate (the `.der` file) into the `trusted/` directory. - -### 2. Remove the None profile - -Remove `None` from the `Profiles` list to prevent unencrypted connections: - -```json -{ - "Security": { - "Profiles": ["Aes256_Sha256_RsaPss-SignAndEncrypt"] - } -} -``` - -### 3. Configure LDAP authentication - -Enable LDAP authentication to validate credentials against the GLAuth server. LDAP group membership controls what each user can do (read, write, alarm acknowledgment). See [Configuration Guide](Configuration.md) for the full LDAP property reference. - -```json -{ - "Authentication": { - "AllowAnonymous": false, - "AnonymousCanWrite": false, - "Ldap": { - "Enabled": true, - "Host": "localhost", - "Port": 3893, - "BaseDN": "dc=lmxopcua,dc=local", - "ServiceAccountDn": "cn=serviceaccount,dc=lmxopcua,dc=local", - "ServiceAccountPassword": "serviceaccount123" - } - } -} -``` - -While UserName tokens are always encrypted by the OPC UA stack (using the server certificate), enabling a secure transport profile adds protection against message-level tampering and data eavesdropping. - -### 4. Review the rejected certificate store - -Periodically inspect the `rejected/` directory. Certificates that appear here were presented by clients but were not trusted. If you recognize a legitimate client certificate, move it to the `trusted/` directory to grant access. - -## X.509 Certificate Authentication - -The server supports X.509 certificate-based user authentication in addition to Anonymous and UserName tokens. When any non-None security profile is configured, the server advertises `UserTokenType.Certificate` in its endpoint descriptions. - -Clients can authenticate by presenting an X.509 certificate. The server extracts the Common Name (CN) from the certificate subject and assigns the `AuthenticatedUser` and `ReadOnly` roles. The authentication is logged with the certificate's CN, subject, and thumbprint. - -X.509 authentication is available automatically when transport security is enabled -- no additional configuration is required. - -## Audit Logging - -The server generates audit log entries for security-relevant operations. All audit entries use the `AUDIT:` prefix and are written to the Serilog rolling file sink for compliance review. - -Audited events: -- **Authentication success**: Logs username, assigned roles, and session ID -- **Authentication failure**: Logs username and session ID -- **X.509 authentication**: Logs certificate CN, subject, and thumbprint -- **Certificate validation**: Logs certificate subject, thumbprint, and expiry for all validation events (accepted or rejected) -- **Write access denial**: Logged by the role-based access control system when a user lacks the required role - -Example audit log entries: -``` -AUDIT: Authentication SUCCESS for user admin with roles [ReadOnly, WriteOperate, AlarmAck] session abc123 -AUDIT: Authentication FAILED for user baduser from session def456 -X509 certificate authenticated: CN=ClientApp, Subject=CN=ClientApp,O=Acme, Thumbprint=AB12CD34 -``` - -## CLI Examples - -The Client CLI supports the `-S` (or `--security`) flag to select the transport security mode when connecting. Valid values are `none`, `sign`, `encrypt`, and `signandencrypt`. - -### Connect with no security - -```bash -dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S none -``` - -### Connect with signing - -```bash -dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S sign -``` - -### Connect with signing and encryption - -```bash -dotnet run --project src/ZB.MOM.WW.OtOpcUa.Client.CLI -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt -``` - -### Browse with encryption and authentication - -```bash -dotnet run -- browse -u opc.tcp://localhost:4840/LmxOpcUa -S encrypt -U operator -P secure-password -r -d 3 -``` - -### Read a node with signing - -```bash -dotnet run -- read -u opc.tcp://localhost:4840/LmxOpcUa -S sign -n "ns=2;s=TestMachine_001/Speed" -``` - -The CLI tool auto-generates its own client certificate on first use (stored under `%LOCALAPPDATA%\OpcUaCli\pki\own\`). When connecting to a server with `AutoAcceptClientCertificates` set to `false`, you must copy the CLI tool's certificate into the server's `trusted/` directory before the connection will succeed. - -## Troubleshooting - -### Certificate trust failure - -**Symptom:** The client receives a `BadSecurityChecksFailed` or `BadCertificateUntrusted` error when connecting. - -**Cause:** The server does not trust the client's certificate (or vice versa), and `AutoAcceptClientCertificates` is `false`. - -**Resolution:** -1. Check the server's `rejected/` directory for the client's certificate file. -2. Copy the `.der` file from `rejected/` to `trusted/`. -3. Retry the connection. -4. If the server's own certificate is not trusted by the client, copy the server's certificate from `pki/own/certs/` to the client's trusted store. - -### Endpoint mismatch - -**Symptom:** The client receives a `BadSecurityModeRejected` or `BadSecurityPolicyRejected` error, or reports "No endpoint found with security mode...". - -**Cause:** The client is requesting a security mode that the server does not expose. For example, the client requests `SignAndEncrypt` but the server only has `None` configured. - -**Resolution:** -1. Verify the server's configured `Profiles` in `appsettings.json`. -2. Ensure the profile matching the client's requested mode is listed (e.g., add `Basic256Sha256-SignAndEncrypt` for encrypted connections). -3. Restart the server after changing the configuration. -4. Use the CLI tool to verify available endpoints: - ```bash - dotnet run -- connect -u opc.tcp://localhost:4840/LmxOpcUa -S none - ``` - The output displays the security mode and policy of the connected endpoint. - -### Server certificate not generated - -**Symptom:** The server logs a warning about application certificate check failure on startup. - -**Cause:** The `pki/own/` directory may not be writable, or the certificate generation failed. - -**Resolution:** -1. Ensure the service account has write access to the PKI root directory. -2. Check that the `PkiRootPath` (if overridden) points to a valid, writable location. -3. Delete any corrupt certificate files in `pki/own/` and restart the server to trigger regeneration. - -### SHA-1 certificate rejection - -**Symptom:** A client with a valid certificate is rejected, and the server logs mention SHA-1. - -**Cause:** The client's certificate was signed with SHA-1, and `RejectSHA1Certificates` is `true` (the default). - -**Resolution:** -- Regenerate the client certificate using SHA-256 or stronger (recommended). -- Alternatively, set `RejectSHA1Certificates` to `false` in the server configuration (not recommended for production). +- Set `AutoAcceptUntrustedClientCertificates = false`. +- Drop `None` from the profile set. +- Use the Admin UI to promote trusted client certs rather than the auto-accept fallback. +- Periodically audit the `rejected/` directory; an unexpected entry is often a misconfigured client or a probe attempt. --- -## LDAP Authentication +## OPC UA Authentication -The server supports LDAP-based user authentication via GLAuth (or any standard LDAP server). When enabled, OPC UA `UserName` token credentials are validated by LDAP bind. LDAP group membership is resolved once during authentication and mapped to custom OPC UA role `NodeId`s in the `urn:zbmom:lmxopcua:roles` namespace. These role NodeIds are stored on the session's `RoleBasedIdentity.GrantedRoleIds` and checked directly during write and alarm-ack operations. +The Server accepts three OPC UA identity-token types: -### Architecture +| Token | Handler | Notes | +|---|---|---| +| Anonymous | `IUserAuthenticator.AuthenticateAsync(username: "", password: "")` | Refused in strict mode unless explicit anonymous grants exist; allowed in lax mode for backward compatibility. | +| UserName/Password | `LdapUserAuthenticator` (`src/ZB.MOM.WW.OtOpcUa.Server/Security/LdapUserAuthenticator.cs`) | LDAP bind + group lookup; resolved `LdapGroups` flow into the session's identity bearer (`ILdapGroupsBearer`). | +| X.509 Certificate | Stack-level acceptance + role mapping via CN | X.509 identity carries `AuthenticatedUser` + read roles; finer-grain authorization happens through the data-plane ACLs. | -``` -OPC UA Client → UserName Token → LmxOpcUa Server → LDAP Bind (validate credentials) - → LDAP Search (resolve group membership) - → Map groups to OPC UA role NodeIds - → Store on RoleBasedIdentity.GrantedRoleIds - → Permission checks via GrantedRoleIds.Contains() +### LDAP bind flow (`LdapUserAuthenticator`) + +`Program.cs` in the Server registers the authenticator based on `OpcUaServer:Ldap`: + +```csharp +builder.Services.AddSingleton(sp => ldapOptions.Enabled + ? new LdapUserAuthenticator(ldapOptions, sp.GetRequiredService>()) + : new DenyAllUserAuthenticator()); ``` -### LDAP Groups and OPC UA Permissions +`LdapUserAuthenticator`: -All authenticated LDAP users can browse and read nodes regardless of group membership. Groups grant additional permissions: +1. Refuses to bind over plain-LDAP unless `AllowInsecureLdap = true` (dev/test only). +2. Connects to `Server:Port`, optionally upgrades to TLS (`UseTls = true`, port 636 for AD). +3. Binds as the service account; searches `SearchBase` for `UserNameAttribute = username`. +4. Rebinds as the resolved user DN with the supplied password (the actual credential check). +5. Reads `GroupAttribute` (default `memberOf`) and strips the leading `CN=` so operators configure friendly group names in `GroupToRole`. +6. Returns a `UserAuthResult` carrying the validated username + the set of LDAP groups. The set flows through to the session identity via `ILdapGroupsBearer.LdapGroups`. -| LDAP Group | Permission | -|---|---| -| ReadOnly | No additional permissions (read-only access) | -| WriteOperate | Write FreeAccess and Operate attributes | -| WriteTune | Write Tune attributes | -| WriteConfigure | Write Configure attributes | -| AlarmAck | Acknowledge alarms | - -Users can belong to multiple groups. The `admin` user in the default GLAuth configuration belongs to all groups. - -### Effective Permission Matrix - -The effective permission for a write operation depends on two factors: the user's session role (from LDAP group membership or anonymous access) and the Galaxy attribute's security classification. The security classification controls the node's `AccessLevel` — attributes classified as `SecuredWrite`, `VerifiedWrite`, or `ViewOnly` are exposed as read-only nodes regardless of the user's role. For writable classifications, the required write role depends on the classification. - -| | FreeAccess | Operate | SecuredWrite | VerifiedWrite | Tune | Configure | ViewOnly | -|---|---|---|---|---|---|---|---| -| **Anonymous (`AnonymousCanWrite=true`)** | Write | Write | Read | Read | Write | Write | Read | -| **Anonymous (`AnonymousCanWrite=false`)** | Read | Read | Read | Read | Read | Read | Read | -| **ReadOnly** | Read | Read | Read | Read | Read | Read | Read | -| **WriteOperate** | Write | Write | Read | Read | Read | Read | Read | -| **WriteTune** | Read | Read | Read | Read | Write | Read | Read | -| **WriteConfigure** | Read | Read | Read | Read | Read | Write | Read | -| **AlarmAck** (only) | Read | Read | Read | Read | Read | Read | Read | -| **Admin** (all groups) | Write | Write | Read | Read | Write | Write | Read | - -All roles can browse and read all nodes. The "Read" entries above mean the node is either read-only by classification or the user lacks the required write role. "Write" means the write is permitted by both the node's classification and the user's role. - -Alarm acknowledgment is an independent permission controlled by the `AlarmAck` role and is not affected by security classification. - -### GLAuth Setup - -The project uses [GLAuth](https://github.com/glauth/glauth) v2.4.0 as the LDAP server, installed at `C:\publish\glauth\`. See `C:\publish\glauth\auth.md` for the complete user/group reference and service management commands. - -### Configuration - -Enable LDAP in `appsettings.json` under `Authentication.Ldap`. See [Configuration Guide](Configuration.md) for the full property reference. - -### Active Directory configuration - -Production deployments typically point at Active Directory instead of GLAuth. Only four properties differ from the dev defaults: `Server`, `Port`, `UserNameAttribute`, and `ServiceAccountDn`. The same `GroupToRole` mechanism works — map your AD security groups to OPC UA roles. +Configuration example (Active Directory production): ```json { @@ -362,32 +129,169 @@ Production deployments typically point at Active Directory instead of GLAuth. On "UseTls": true, "AllowInsecureLdap": false, "SearchBase": "DC=corp,DC=example,DC=com", - "ServiceAccountDn": "CN=OpcUaSvc,OU=Service Accounts,DC=corp,DC=example,DC=com", + "ServiceAccountDn": "CN=OtOpcUaSvc,OU=Service Accounts,DC=corp,DC=example,DC=com", "ServiceAccountPassword": "", - "DisplayNameAttribute": "displayName", "GroupAttribute": "memberOf", "UserNameAttribute": "sAMAccountName", "GroupToRole": { "OPCUA-Operators": "WriteOperate", "OPCUA-Engineers": "WriteConfigure", - "OPCUA-AlarmAck": "AlarmAck", - "OPCUA-Tuners": "WriteTune" + "OPCUA-Tuners": "WriteTune", + "OPCUA-AlarmAck": "AlarmAck" } } } } ``` -Notes: +`UserNameAttribute: "sAMAccountName"` is the critical AD override — the default `uid` is not populated on AD user entries. Use `userPrincipalName` instead if operators log in with `user@corp.example.com` form. Nested group membership is not expanded — assign users directly to the role-mapped groups, or pre-flatten in AD. -- `UserNameAttribute: "sAMAccountName"` is the critical AD override — the default `uid` is not populated on AD user entries, so the user-DN lookup returns no results without it. Use `userPrincipalName` instead if operators log in with `user@corp.example.com` form. -- `Port: 636` + `UseTls: true` is required under AD's LDAP-signing enforcement. AD increasingly rejects plain-LDAP bind; set `AllowInsecureLdap: false` to refuse fallback. -- `ServiceAccountDn` should name a dedicated read-only service principal — not a privileged admin. The account needs read access to user and group entries in the search base. -- `memberOf` values come back as full DNs like `CN=OPCUA-Operators,OU=OPC UA Security Groups,OU=Groups,DC=corp,DC=example,DC=com`. The authenticator strips the leading `CN=` RDN value so operators configure `GroupToRole` with readable group common-names. -- Nested group membership is **not** expanded — assign users directly to the role-mapped groups, or pre-flatten membership in AD. `LDAP_MATCHING_RULE_IN_CHAIN` / `tokenGroups` expansion is an authenticator enhancement, not a config change. +The same options bind the Admin's `LdapAuthService` (cookie auth / login form) so operators authenticate with a single credential across both processes. -### Security Considerations +--- -- LDAP credentials are transmitted in plaintext over the OPC UA channel unless transport security is enabled. Use `Basic256Sha256-SignAndEncrypt` for production deployments. -- The GLAuth LDAP server itself listens on plain LDAP (port 3893). Enable LDAPS in `glauth.cfg` for environments where LDAP traffic crosses network boundaries. -- The service account password is stored in `appsettings.json`. Protect this file with appropriate filesystem permissions. +## Data-Plane Authorization + +Data-plane authorization is the check run on every OPC UA operation against an OtOpcUa endpoint: *can this authenticated user Browse / Read / Subscribe / Write / HistoryRead / AckAlarm / Call on this specific node?* + +Per decision #129 the model is **additive-only — no explicit Deny**. Grants at each hierarchy level union; absence of a grant is the default-deny. + +### Hierarchy + +ACLs are evaluated against the UNS path: + +``` +ClusterId → Namespace → UnsArea → UnsLine → Equipment → Tag +``` + +Each level can carry `NodeAcl` rows (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/NodeAcl.cs`) that grant a permission bundle to a set of `LdapGroups`. + +### Permission flags + +```csharp +[Flags] +public enum NodePermissions : uint +{ + Browse = 1 << 0, + Read = 1 << 1, + Subscribe = 1 << 2, + HistoryRead = 1 << 3, + WriteOperate = 1 << 4, + WriteTune = 1 << 5, + WriteConfigure = 1 << 6, + AlarmRead = 1 << 7, + AlarmAcknowledge = 1 << 8, + AlarmConfirm = 1 << 9, + AlarmShelve = 1 << 10, + MethodCall = 1 << 11, + + ReadOnly = Browse | Read | Subscribe | HistoryRead | AlarmRead, + Operator = ReadOnly | WriteOperate | AlarmAcknowledge | AlarmConfirm, + Engineer = Operator | WriteTune | AlarmShelve, + Admin = Engineer | WriteConfigure | MethodCall, +} +``` + +The three Write tiers map to Galaxy's v1 `SecurityClassification` — `FreeAccess`/`Operate` → `WriteOperate`, `Tune` → `WriteTune`, `Configure` → `WriteConfigure`. `SecuredWrite` / `VerifiedWrite` / `ViewOnly` classifications remain read-only from OPC UA regardless of grant. + +### Evaluator — `PermissionTrie` + +`src/ZB.MOM.WW.OtOpcUa.Core/Authorization/`: + +| Class | Role | +|---|---| +| `PermissionTrie` | Cluster-scoped trie; each node carries `(GroupId → NodePermissions)` grants. | +| `PermissionTrieBuilder` | Builds a trie from the current `NodeAcl` rows in one pass. | +| `PermissionTrieCache` | Per-cluster memoised trie; invalidated via `AclChangeNotifier` when the Admin publishes a draft that touches ACLs. | +| `TriePermissionEvaluator` | Implements `IPermissionEvaluator.Authorize(session, operation, scope)` — walks from the root to the leaf for the supplied `NodeScope`, unions grants along the path, compares required permission to the union. | + +`NodeScope` carries `(ClusterId, NamespaceId, AreaId, LineId, EquipmentId, TagId)`; any suffix may be null — a tag-level ACL is more specific than an area-level ACL but both contribute via union. + +### Dispatch gate — `AuthorizationGate` + +`src/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs` bridges the OPC UA stack's `ISystemContext.UserIdentity` to the evaluator. `DriverNodeManager` holds exactly one reference to it and calls `IsAllowed(identity, OpcUaOperation.*, NodeScope)` on every Read, Write, HistoryRead, Browse, Subscribe, AckAlarm, Call path. A false return short-circuits the dispatch with `BadUserAccessDenied`. + +Key properties: + +- **Driver-agnostic.** No driver-level code participates in authorization decisions. Drivers report `SecurityClassification` as metadata on tag discovery; everything else flows through `AuthorizationGate`. +- **Fail-open-during-transition.** `StrictMode = false` (default during ACL rollouts) lets sessions without resolved LDAP groups proceed; flip `Authorization:StrictMode = true` in production once ACLs are populated. +- **Evaluator stays pure.** `TriePermissionEvaluator` has no OPC UA stack dependency — it's tested directly from xUnit. + +### Probe-this-permission (Admin UI) + +`PermissionProbeService` (`src/ZB.MOM.WW.OtOpcUa.Admin/Services/PermissionProbeService.cs`) lets an operator ask "if a user with groups X, Y, Z asked to do operation O on node N, would it succeed?" The answer is rendered in the AclsTab "Probe" dialog — same evaluator, same trie, so the Admin UI answer and the live Server answer cannot disagree. + +### Full model + +See [`docs/v2/acl-design.md`](v2/acl-design.md) for the complete design: trie invalidation, flag semantics, per-path override rules, and the reasoning behind additive-only (no Deny). + +--- + +## Control-Plane Authorization + +Control-plane authorization governs **the Admin UI** — who can view fleet config, edit drafts, publish generations, manage cluster nodes + credentials. + +Per decision #150 control-plane roles are **deliberately independent of data-plane ACLs**. An operator who can read every OPC UA tag in production may not be allowed to edit cluster config; conversely a ConfigEditor may not have any data-plane grants at all. + +### Roles + +`src/ZB.MOM.WW.OtOpcUa.Admin/Services/AdminRoles.cs`: + +| Role | Capabilities | +|---|---| +| `ConfigViewer` | Read-only access to drafts, generations, audit log, fleet status. | +| `ConfigEditor` | ConfigViewer plus draft editing (UNS, equipment, tags, ACLs, driver instances, reservations, CSV imports). Cannot publish. | +| `FleetAdmin` | ConfigEditor plus publish, cluster/node CRUD, credential management, role-grant management. | + +Policies registered in Admin `Program.cs`: + +```csharp +builder.Services.AddAuthorizationBuilder() + .AddPolicy("CanEdit", p => p.RequireRole(AdminRoles.ConfigEditor, AdminRoles.FleetAdmin)) + .AddPolicy("CanPublish", p => p.RequireRole(AdminRoles.FleetAdmin)); +``` + +Razor pages and API endpoints gate with `[Authorize(Policy = "CanEdit")]` / `"CanPublish"`; nav-menu sections hide via ``. + +### Role grant source + +Admin reads `LdapGroupRoleMapping` rows from the Config DB (`src/ZB.MOM.WW.OtOpcUa.Configuration/Entities/LdapGroupRoleMapping.cs`) — the same pattern as the data-plane `NodeAcl` but scoped to Admin roles + (optionally) cluster scope for multi-site fleets. The `RoleGrants.razor` page lets FleetAdmins edit these mappings without leaving the UI. + +--- + +## OTOPCUA0001 Analyzer — Compile-Time Guard + +Per-capability resilience (retry, timeout, circuit-breaker, bulkhead) is applied by `CapabilityInvoker` in `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/`. A driver-capability call made **outside** the invoker bypasses resilience entirely — which in production looks like inconsistent timeouts, un-wrapped retries, and unbounded blocking. + +`OTOPCUA0001` (Roslyn analyzer at `src/ZB.MOM.WW.OtOpcUa.Analyzers/UnwrappedCapabilityCallAnalyzer.cs`) fires as a compile-time **warning** when an `async`/`Task`-returning method on one of the seven guarded capability interfaces (`IReadable`, `IWritable`, `ITagDiscovery`, `ISubscribable`, `IHostConnectivityProbe`, `IAlarmSource`, `IHistoryProvider`) is invoked **outside** a lambda passed to `CapabilityInvoker.ExecuteAsync` / `ExecuteWriteAsync` / `AlarmSurfaceInvoker.*`. The analyzer walks up the syntax tree from the call site, finds any enclosing invoker invocation, and verifies the call lives transitively inside that invocation's anonymous-function argument — a sibling pattern (do the call, then invoke `ExecuteAsync` on something unrelated nearby) does not satisfy the rule. + +Five xUnit-v3 + Shouldly tests at `tests/ZB.MOM.WW.OtOpcUa.Analyzers.Tests` cover the common fail/pass shapes + the sibling-pattern regression guard. + +The rule is intentionally scoped to async surfaces — pure in-memory accessors like `IHostConnectivityProbe.GetHostStatuses()` return synchronously and do not require the invoker wrap. + +--- + +## Audit Logging + +- **Server**: Serilog `AUDIT:` prefix on every authentication success/failure, certificate validation result, write access denial. Written alongside the regular rolling file sink. +- **Admin**: `AuditLogService` writes `ConfigAuditLog` rows to the Config DB for every publish, rollback, cluster-node CRUD, credential rotation. Visible in the Audit page for operators with `ConfigViewer` or above. + +--- + +## Troubleshooting + +### Certificate trust failure + +Check `{PkiStoreRoot}/rejected/` for the client's cert. Promote via Admin UI Certificates page, or copy the `.der` file manually to `trusted/`. + +### LDAP users can connect but fail authorization + +Verify (a) `OpcUaServer:Ldap:GroupAttribute` returns groups in the form `CN=MyGroup,…` (OtOpcUa strips the `CN=` for matching), (b) a `NodeAcl` grant exists at any level of the node's UNS path that unions to the required permission, (c) `Authorization:StrictMode` is correctly set for the deployment stage. + +### LDAP bind rejected as "insecure" + +Set `UseTls = true` + `Port = 636`, or temporarily flip `AllowInsecureLdap = true` in dev. Production Active Directory increasingly refuses plain-LDAP bind under LDAP-signing enforcement. + +### `AuthorizationGate` denies every call after a publish + +`AclChangeNotifier` invalidates the `PermissionTrieCache` on publish; a stuck cache is usually a missed notification. Restart the Server as a quick mitigation and file a bug — the design is to stay fresh without restarts.