Files

Joseph Doherty f2ea751e2b Document the Galaxy runtime status deploy so operators can reconstruct the stop/start verification sequence, the two bugs found in-flight, and the phase-2 client-freeze decision gate without having to dig through the plan file or chat transcript

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 16:24:18 -04:00

40 KiB

Raw Blame History

Service Update Summary

Updated service instance: C:\publish\lmxopcua\instance1

Update time: 2026-03-25 12:54-12:55 America/New_York

Backup created before deploy: C:\publish\lmxopcua\backups\20260325-125444

Configuration preserved:

C:\publish\lmxopcua\instance1\appsettings.json was not overwritten.

Deployed binary:

C:\publish\lmxopcua\instance1\ZB.MOM.WW.LmxOpcUa.Host.exe
Last write time: 2026-03-25 12:53:58
Size: 143360

Windows service:

Name: LmxOpcUa
Display name: LMX OPC UA Server
Account: LocalSystem
Status after update: Running
Process ID after restart: 29236

Restart evidence:

Service log file: C:\publish\lmxopcua\instance1\logs\lmxopcua-20260325_004.log
Last startup line: 2026-03-25 12:55:08.619 -04:00 [INF] The LmxOpcUa service was started.

CLI Verification

Endpoint from deployed config:

opc.tcp://localhost:4840/LmxOpcUa

CLI used:

C:\Users\dohertj2\Desktop\lmxopcua\tools\opcuacli-dotnet\bin\Debug\net10.0\opcuacli-dotnet.exe

Commands run:

opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe read -u opc.tcp://localhost:4840/LmxOpcUa -n 'ns=1;s=MESReceiver_001.MoveInPartNumbers'
opcuacli-dotnet.exe read -u opc.tcp://localhost:4840/LmxOpcUa -n 'ns=1;s=MESReceiver_001.MoveInPartNumbers[]'

Observed results:

connect: succeeded, server reported as LmxOpcUa.
read ns=1;s=MESReceiver_001.MoveInPartNumbers: succeeded with good status 0x00000000.
read ns=1;s=MESReceiver_001.MoveInPartNumbers[]: failed with BadNodeIdUnknown (0x80340000).

Instance 2 (Redundant Secondary)

Deployed: 2026-03-28

Deployment path: C:\publish\lmxopcua\instance2

Configuration:

OpcUa.Port: 4841
OpcUa.ServerName: LmxOpcUa2
OpcUa.ApplicationUri: urn:localhost:LmxOpcUa:instance2
Dashboard.Port: 8082
MxAccess.ClientName: LmxOpcUa2
Redundancy.Enabled: true
Redundancy.Mode: Warm
Redundancy.Role: Secondary
Redundancy.ServerUris: ["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"]

Windows service:

Name: LmxOpcUa2
Display name: LMX OPC UA Server (Instance 2)
Account: LocalSystem
Endpoint: opc.tcp://localhost:4841/LmxOpcUa

Instance 1 redundancy update (same date):

OpcUa.ApplicationUri: urn:localhost:LmxOpcUa:instance1
Redundancy.Enabled: true
Redundancy.Mode: Warm
Redundancy.Role: Primary
Redundancy.ServerUris: ["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"]

CLI verification:

opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
  → Redundancy Mode: Warm, Service Level: 200, Application URI: urn:localhost:LmxOpcUa:instance1

opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
  → Redundancy Mode: Warm, Service Level: 150, Application URI: urn:localhost:LmxOpcUa:instance2

Both instances report the same ServerUriArray and expose the same Galaxy namespace (urn:ZB:LmxOpcUa).

LDAP Authentication Update

Updated: 2026-03-28

Both instances updated to use LDAP authentication via GLAuth.

Configuration changes (both instances):

Authentication.AllowAnonymous: true (anonymous can browse/read)
Authentication.AnonymousCanWrite: false (anonymous writes blocked)
Authentication.Ldap.Enabled: true
Authentication.Ldap.Host: localhost
Authentication.Ldap.Port: 3893
Authentication.Ldap.BaseDN: dc=lmxopcua,dc=local

LDAP server: GLAuth v2.4.0 at C:\publish\glauth\ (Windows service: GLAuth)

Permission verification (instance1, port 4840):

anonymous read    → allowed
anonymous write   → denied  (BadUserAccessDenied)
readonly  read    → allowed
readonly  write   → denied  (BadUserAccessDenied)
readwrite write   → allowed
admin     write   → allowed
alarmack  write   → denied  (BadUserAccessDenied)
bad password      → denied  (connection rejected)

Alarm Notifier Chain Update

Updated: 2026-03-28

Both instances updated with alarm event propagation up the notifier chain.

Code changes:

Alarm events now walk up the parent chain (ReportEventUpNotifierChain), reporting to every ancestor node
EventNotifier = SubscribeToEvents is set on all ancestors of alarm-containing nodes (EnableEventNotifierUpChain)
Removed separate Server.ReportEvent call (no longer needed — the walk reaches the root)

No configuration changes required — alarm tracking was already enabled (AlarmTrackingEnabled: true).

Verification (instance1, port 4840):

alarms --node TestArea --refresh:
  TestMachine_001.TestAlarm001 → visible (Severity=500, Retain=True)
  TestMachine_001.TestAlarm002 → visible (Severity=500, Retain=True)
  TestMachine_001.TestAlarm003 → visible (Severity=500, Retain=True)
  TestMachine_002.TestAlarm001 → visible (Severity=500, Retain=True)
  TestMachine_002.TestAlarm003 → visible (Severity=500, Retain=True)

alarms --node DEV --refresh:
  Same 5 alarms visible at DEV (grandparent) level

Auth Consolidation Update

Updated: 2026-03-28

Both instances updated to consolidate LDAP roles into OPC UA session roles (RoleBasedIdentity.GrantedRoleIds).

Code changes:

LDAP groups now map to custom OPC UA role NodeIds in urn:zbmom:lmxopcua:roles namespace
Roles stored on session identity via GrantedRoleIds — no username-to-role side cache
Permission checks use GrantedRoleIds.Contains() instead of username extraction
AnonymousCanWrite behavior is consistent regardless of LDAP state
Galaxy namespace moved from ns=2 to ns=3 (roles namespace is ns=2)

No configuration changes required.

Verification (instance1, port 4840):

anonymous read      → allowed
anonymous write     → denied  (BadUserAccessDenied, AnonymousCanWrite=false)
readonly  write     → denied  (BadUserAccessDenied)
readwrite write     → allowed
admin     write     → allowed
alarmack  write     → denied  (BadUserAccessDenied)
bad password        → rejected (connection failed)

Granular Write Roles Update

Updated: 2026-03-28

Both instances updated with granular write roles replacing the single ReadWrite role.

Code changes:

ReadWrite role replaced by WriteOperate, WriteTune, WriteConfigure
Write permission checks now consider the Galaxy security classification of the target attribute
SecurityClassification stored in TagMetadata for per-node lookup at write time

GLAuth changes:

New groups: WriteOperate (5502), WriteTune (5504), WriteConfigure (5505)
New users: writeop, writetune, writeconfig
admin user added to all groups (5502, 5503, 5504, 5505)

Config changes (both instances):

Authentication.Ldap.ReadWriteGroup replaced by WriteOperateGroup, WriteTuneGroup, WriteConfigureGroup

Verification (instance1, port 4840, Operate-classified attributes):

anonymous read        → allowed
anonymous write       → denied  (AnonymousCanWrite=false)
readonly  write       → denied  (no write role)
writeop   write       → allowed (WriteOperate matches Operate classification)
writetune write       → denied  (WriteTune doesn't match Operate)
writeconfig write     → denied  (WriteConfigure doesn't match Operate)
admin     write       → allowed (has all write roles)

Historian SDK Migration

Updated: 2026-04-06

Both instances updated to use the Wonderware Historian SDK (aahClientManaged.dll) instead of direct SQL queries for historical data access.

Code changes:

HistorianDataSource rewritten from SqlConnection/SqlDataReader to ArchestrA.HistorianAccess SDK
Persistent connection with lazy connect and auto-reconnect on failure
HistorianConfiguration.ConnectionString replaced with ServerName, IntegratedSecurity, UserName, Password, Port
HistorianDataSource now implements IDisposable, disposed on service shutdown
ConfigurationValidator validates Historian SDK settings at startup

SDK DLLs deployed to both instances:

aahClientManaged.dll (primary SDK, v2.0.0.0)
aahClient.dll, aahClientCommon.dll (dependencies)
Historian.CBE.dll, Historian.DPAPI.dll, ArchestrA.CloudHistorian.Contract.dll

Configuration changes (both instances):

Historian.ConnectionString removed
Historian.ServerName: "localhost"
Historian.IntegratedSecurity: true
Historian.Port: 32568
Historian.Enabled: true (unchanged)

Verification (instance1 startup log):

Historian.Enabled=true, ServerName=localhost, IntegratedSecurity=true, Port=32568
Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000
=== Configuration Valid ===
LmxOpcUa service started successfully

HistoryServerCapabilities and Continuation Points

Updated: 2026-04-06

Both instances updated with OPC UA Part 11 spec compliance improvements.

Code changes:

HistoryServerCapabilities node populated under ServerCapabilities with all boolean capability properties
AggregateFunctions folder populated with references to 7 supported aggregate functions
HistoryContinuationPointManager added — stores remaining data when results exceed NumValuesPerNode
HistoryReadRawModified and HistoryReadProcessed now return ContinuationPoint in HistoryReadResult for partial reads
Follow-up requests with ContinuationPoint resume from stored state; invalid/expired points return BadContinuationPointInvalid

No configuration changes required.

Verification (instance1 startup log):

HistoryServerCapabilities configured with 7 aggregate functions
LmxOpcUa service started successfully

Remaining Historian Gaps Fix

Updated: 2026-04-06

Both instances updated with remaining OPC UA Part 11 spec compliance fixes.

Code changes:

Gap 4: HistoryReadRawModified returns BadHistoryOperationUnsupported when IsReadModified=true
Gap 5: HistoryReadAtTime override added with ReadAtTimeAsync using SDK HistorianRetrievalMode.Interpolated
Gap 8: HistoricalDataConfigurationState child nodes added to historized variables (Stepped=false, Definition="Wonderware Historian")
Gap 10: ReturnBounds parameter handled — boundary DataValue entries with BadBoundNotFound inserted at StartTime/EndTime
Gap 11: StandardDeviation aggregate added to client enum, mapper, CLI (aliases: stddev/stdev), and UI dropdown

No configuration changes required.

Historical Event Access

Updated: 2026-04-06

Both instances updated with OPC UA historical event access (Gap 7).

Code changes:

HistorianDataSource.ReadEventsAsync queries Historian event store via separate HistorianConnectionType.Event connection
LmxNodeManager.HistoryReadEvents override maps HistorianEvent records to OPC UA HistoryEventFieldList entries
AccessHistoryEventsCapability set to true when AlarmTrackingEnabled is true
Event fields: EventId, EventType, SourceNode, SourceName, Time, ReceiveTime, Message, Severity

No configuration changes required. All historian gaps (1-11) are now resolved.

Data Access Gaps Fix

Updated: 2026-04-06

Both instances updated with OPC UA DA spec compliance fixes.

Code changes:

ConfigureServerCapabilities() populates ServerCapabilities node: ServerProfileArray, LocaleIdArray, MinSupportedSampleRate, continuation point limits, array/string limits, and 12 OperationLimits values
Server_ServerDiagnostics_EnabledFlag set to true — SDK auto-tracks session/subscription counts
OnModifyMonitoredItemsComplete override logs monitored item modifications

No configuration changes required. All DA gaps (1-8) resolved.

Alarms & Conditions Gaps Fix

Updated: 2026-04-06

Both instances updated with OPC UA Part 9 alarm spec compliance fixes.

Code changes:

Wired OnConfirm, OnAddComment, OnEnableDisable, OnShelve, OnTimedUnshelve handlers on each AlarmConditionState
Shelving: SetShelvingState() manages TimedShelve, OneShotShelve, Unshelve state machine
ReportAlarmEvent now populates LocalTime (timezone offset + DST) and Quality event fields
Flaky Monitor_ProbeDataChange_PreventsStaleReconnect test fixed (increased stale threshold from 2s to 5s)

No configuration changes required. All A&C gaps (1-10) resolved.

Security Gaps Fix

Updated: 2026-04-06

Both instances updated with OPC UA Part 2/4/7 security spec compliance fixes.

Code changes:

SecurityProfileResolver: Added 4 modern AES profiles (Aes128_Sha256_RsaOaep-Sign/SignAndEncrypt, Aes256_Sha256_RsaPss-Sign/SignAndEncrypt)
OnImpersonateUser: Added X509IdentityToken handling with CN extraction and role assignment
BuildUserTokenPolicies: Advertises UserTokenType.Certificate when non-None security profiles are configured
OnCertificateValidation: Enhanced logging with certificate thumbprint, subject, and expiry
Authentication audit logging: AUDIT: prefixed log entries for success/failure with session ID and roles

No configuration changes required. All security gaps (1-10) resolved.

Historian Plugin Runtime Load + Dashboard Health

Updated: 2026-04-12 18:47-18:49 America/New_York

Both instances updated to the latest build. Brings in the runtime-loaded Historian plugin (Historian/ subfolder next to the Host) and the status dashboard health surface for historian plugin + alarm-tracking misconfiguration.

Backups created before deploy:

C:\publish\lmxopcua\backups\20260412-184713-instance1
C:\publish\lmxopcua\backups\20260412-184713-instance2

Configuration preserved:

C:\publish\lmxopcua\instance1\appsettings.json was not overwritten.
C:\publish\lmxopcua\instance2\appsettings.json was not overwritten.

Layout change:

Flat historian interop DLLs removed from each instance root (aahClient*.dll, ArchestrA.CloudHistorian.Contract.dll, Historian.CBE.dll, Historian.DPAPI.dll).
Historian plugin + interop DLLs now live under <instance>\Historian\ (including ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll), loaded by HistorianPluginLoader.

Deployed binary (both instances):

ZB.MOM.WW.LmxOpcUa.Host.exe
Last write time: 2026-04-12 18:46:22 -04:00
Size: 7938048

Windows services:

LmxOpcUa — Running, PID 40176
LmxOpcUa2 — Running, PID 34400

Restart evidence (instance1 logs/lmxopcua-20260412.log):

2026-04-12 18:48:02.968 -04:00 [INF] Historian.Enabled=true, ServerName=localhost, IntegratedSecurity=true, Port=32568
2026-04-12 18:48:02.971 -04:00 [INF] === Configuration Valid ===
2026-04-12 18:48:09.658 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance1\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-12 18:48:13.691 -04:00 [INF] LmxOpcUa service started successfully

Restart evidence (instance2 logs/lmxopcua-20260412.log):

2026-04-12 18:49:08.152 -04:00 [INF] Historian.Enabled=true, ServerName=localhost, IntegratedSecurity=true, Port=32568
2026-04-12 18:49:08.155 -04:00 [INF] === Configuration Valid ===
2026-04-12 18:49:14.744 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance2\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-12 18:49:18.777 -04:00 [INF] LmxOpcUa service started successfully

CLI verification (via dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI):

connect    opc.tcp://localhost:4840/LmxOpcUa → Server: LmxOpcUa
connect    opc.tcp://localhost:4841/LmxOpcUa → Server: LmxOpcUa2
redundancy opc.tcp://localhost:4840/LmxOpcUa → Warm, ServiceLevel=200, urn:localhost:LmxOpcUa:instance1
redundancy opc.tcp://localhost:4841/LmxOpcUa → Warm, ServiceLevel=150, urn:localhost:LmxOpcUa:instance2

Both instances report the same ServerUriArray and the primary advertises the higher ServiceLevel, matching the prior redundancy baseline.

Endpoints Panel on Dashboard

Updated: 2026-04-13 08:46-08:50 America/New_York

Both instances updated with a new Endpoints panel on the status dashboard surfacing the opc.tcp base addresses, active OPC UA security profiles (mode + policy name + full URI), and user token policies.

Code changes:

StatusData.cs — added EndpointsInfo / SecurityProfileInfo DTOs on StatusData.
OpcUaServerHost.cs — added BaseAddresses, SecurityPolicies, UserTokenPolicies runtime accessors reading ApplicationConfiguration.ServerConfiguration live state.
StatusReportService.cs — builds EndpointsInfo from the host and renders a new panel with a graceful empty state when the server is not started.

No configuration changes required.

Verification (instance1 @ http://localhost:8085/):

Base Addresses: opc.tcp://localhost:4840/LmxOpcUa
Security Profiles: None / None / http://opcfoundation.org/UA/SecurityPolicy#None
User Token Policies: Anonymous, UserName

Verification (instance2 @ http://localhost:8086/):

Base Addresses: opc.tcp://localhost:4841/LmxOpcUa
Security Profiles: None / None / http://opcfoundation.org/UA/SecurityPolicy#None
User Token Policies: Anonymous, UserName

Template-Based Alarm Object Filter

Updated: 2026-04-13 09:39-09:43 America/New_York

Both instances updated with a new configurable alarm object filter. When OpcUa.AlarmFilter.ObjectFilters is non-empty, only Galaxy objects whose template derivation chain matches a pattern (and their containment-tree descendants) contribute AlarmConditionState nodes. When the list is empty, the current unfiltered behavior is preserved (backward-compatible default).

Backups created before deploy:

C:\publish\lmxopcua\backups\20260413-093900-instance1
C:\publish\lmxopcua\backups\20260413-093900-instance2

Deployed binary (both instances):

ZB.MOM.WW.LmxOpcUa.Host.exe
Last write time: 2026-04-13 09:38:46 -04:00
Size: 7951360

Windows services:

LmxOpcUa — Running, PID 40900
LmxOpcUa2 — Running, PID 29936

Code changes:

gr/queries/hierarchy.sql — added recursive CTE on gobject.derived_from_gobject_id and a new template_chain column (pipe-delimited, innermost template first).
Domain/GalaxyObjectInfo.cs — added TemplateChain: List<string> populated from the new SQL column.
GalaxyRepositoryService.cs — reads the new column and splits into TemplateChain.
Configuration/AlarmFilterConfiguration.cs (new) — List<string> ObjectFilters; entries may themselves be comma-separated. Attached to OpcUaConfiguration.AlarmFilter.
Configuration/ConfigurationValidator.cs — logs the effective filter and warns if patterns are configured while AlarmTrackingEnabled == false.
Domain/AlarmObjectFilter.cs (new) — compiles wildcard patterns (* only) to case-insensitive regexes with Galaxy $ prefix normalized on both sides; walks the hierarchy top-down with cycle defense; returns a HashSet<int> of included gobject IDs plus UnmatchedPatterns for startup warnings.
OpcUa/LmxNodeManager.cs — constructor accepts the filter; the two alarm-creation loops (BuildAddressSpace full build and the subtree rebuild path) both call ResolveAlarmFilterIncludedIds(sorted) and skip any object not in the resolved set. New public properties expose filter state to the dashboard: AlarmFilterEnabled, AlarmFilterPatternCount, AlarmFilterIncludedObjectCount.
OpcUa/OpcUaServerHost.cs, OpcUa/LmxOpcUaServer.cs, OpcUaService.cs, OpcUaServiceBuilder.cs — plumbing to construct and thread the filter from appsettings.json down to the node manager.
Status/StatusData.cs + Status/StatusReportService.cs — AlarmStatusInfo gains FilterEnabled, FilterPatternCount, FilterIncludedObjectCount; a filter summary line renders in the Alarms panel when the filter is active.

Tests:

36 new unit tests in tests/.../Domain/AlarmObjectFilterTests.cs covering pattern parsing, wildcard semantics, regex escaping, Galaxy $ normalization, template-chain matching, subtree propagation, set semantics, orphan/cycle defense, and UnmatchedPatterns tracking.
5 new integration tests in tests/.../Integration/AlarmObjectFilterIntegrationTests.cs spinning up a real LmxNodeManager via OpcUaServerFixture and asserting AlarmConditionCount/AlarmFilterIncludedObjectCount under various filters.
1 new Status test verifying JSON exposes the filter counters.
Full suite: 446/446 tests passing (no regressions).

Configuration change: both instances have OpcUa.AlarmFilter.ObjectFilters: [] (filter disabled, unfiltered alarm tracking preserved).

Live verification against instance1 Galaxy (filter temporarily set to "TestMachine"):

2026-04-13 09:41:31 [INF] OpcUa.AlarmTrackingEnabled=true, AlarmFilter.ObjectFilters=[TestMachine]
2026-04-13 09:41:42 [INF] Alarm filter: 42 of 49 objects included (1 pattern(s))
Dashboard Alarms panel: Tracking: True | Conditions: 60 | Active: 4
                        Filter: 1 pattern(s), 42 object(s) included

Final configuration restored to empty filter. Dashboard confirms unfiltered behavior on both endpoints:

instance1 @ http://localhost:8085/ → Conditions: 60 | Active: 4 (no filter line)
instance2 @ http://localhost:8086/ → Conditions: 60 | Active: 4 (no filter line)

Filter syntax quick reference (documented in AlarmFilterConfiguration.cs XML-doc):

* is the only wildcard (glob-style; zero or more characters).
Matching is case-insensitive and ignores the Galaxy leading $ template prefix on both the pattern and the stored chain entry, so operators write TestMachine* not $TestMachine*.
Each entry may contain comma-separated patterns for convenience (e.g., "TestMachine*, Pump_*").
Empty list → filter disabled → current unfiltered behavior.
Match semantics: an object is included when any template in its derivation chain matches any pattern, and the inclusion propagates to all descendants in the containment hierarchy. Each object is evaluated once regardless of how many patterns or ancestors match.

Historian Runtime Health Surface

Updated: 2026-04-13 10:44-10:52 America/New_York

Both instances updated with runtime historian query instrumentation so the status dashboard can detect silent query degradation that the load-time PluginStatus cannot catch.

Backups:

C:\publish\lmxopcua\backups\20260413-104406-instance1
C:\publish\lmxopcua\backups\20260413-104406-instance2

Code changes:

Host/Historian/HistorianHealthSnapshot.cs (new) — DTO with TotalQueries, TotalSuccesses, TotalFailures, ConsecutiveFailures, LastSuccessTime, LastFailureTime, LastError, ProcessConnectionOpen, EventConnectionOpen.
Host/Historian/IHistorianDataSource.cs — added GetHealthSnapshot() interface method.
Historian.Aveva/HistorianDataSource.cs — added _healthLock-guarded counters, RecordSuccess() / RecordFailure(path) helpers called at every terminal site in all four read methods (raw, aggregate, at-time, events). Error messages carry a raw: / aggregate: / at-time: / events: prefix so operators can tell which SDK call is broken.
Host/OpcUa/LmxNodeManager.cs — exposes HistorianHealth property that proxies to IHistorianDataSource.GetHealthSnapshot().
Host/Status/StatusData.cs — added 9 new fields on HistorianStatusInfo.
Host/Status/StatusReportService.cs — BuildHistorianStatusInfo() populates the new fields from the node manager; panel color gradient: green → yellow (1-4 consecutive failures) → red (≥5 consecutive or plugin unloaded). Renders Queries: N (Success: X, Failure: Y) | Consecutive Failures: Z, Process Conn: open/closed | Event Conn: open/closed, plus Last Success: / Last Failure: / Last Error: lines when applicable.
Host/Status/HealthCheckService.cs — new Rule 2b2: Degraded when ConsecutiveFailures >= 3. Threshold chosen to avoid flagging single transient blips.

Tests:

5 new unit tests in HistorianDataSourceLifecycleTests covering fresh zero-state, single failure, multi-failure consecutive increment, cross-read-path counting, and error-message-carries-path.
Full suite: 16/16 plugin tests, 447/447 host tests passing.

Live verification on instance1:

Before any query:
  Queries: 0 (Success: 0, Failure: 0) | Process Conn: closed | Event Conn: closed
After TestMachine_001.TestHistoryValue raw read:
  Queries: 1 (Success: 1, Failure: 0) | Process Conn: open
  Last Success: 2026-04-13T14:45:18Z
After aggregate hourly-average over 24h:
  Queries: 2 (Success: 2, Failure: 0)
After historyread against an unknown node id (bad tag):
  Queries: 2 (counter unchanged — rejected at node-lookup before reaching the plugin; correct)

JSON endpoint /api/status carries all 9 new fields with correct types. Both instances deployed; instance1 LmxOpcUa PID 33824, instance2 LmxOpcUa2 PID 30200.

Historian Read-Only Cluster Support

Updated: 2026-04-13 11:25-12:00 America/New_York

Both instances updated with Wonderware Historian read-only cluster failover. Operators can supply an ordered list of historian cluster nodes; the plugin iterates them on each fresh connect and benches failed nodes for a configurable cooldown window. Single-node deployments are preserved via the existing ServerName field.

Backups:

C:\publish\lmxopcua\backups\20260413-112519-instance1
C:\publish\lmxopcua\backups\20260413-112519-instance2

Code changes:

Host/Configuration/HistorianConfiguration.cs — added ServerNames: List<string> (defaults to []) and FailureCooldownSeconds: int (defaults to 60). ServerName preserved as fallback when ServerNames is empty.
Host/Historian/HistorianClusterNodeState.cs (new) — per-node DTO: Name, IsHealthy, CooldownUntil, FailureCount, LastError, LastFailureTime.
Host/Historian/HistorianHealthSnapshot.cs — extended with ActiveProcessNode, ActiveEventNode, NodeCount, HealthyNodeCount, Nodes: List<HistorianClusterNodeState>.
Historian.Aveva/HistorianClusterEndpointPicker.cs (new, internal) — pure picker with injected clock, thread-safe via lock, BFS-style GetHealthyNodes() / MarkFailed() / MarkHealthy() / SnapshotNodeStates(). Nodes iterate in configuration order; failed nodes skip until cooldown elapses; the cumulative FailureCount and LastError are retained across recovery for operator diagnostics.
Historian.Aveva/HistorianDataSource.cs — new ConnectToAnyHealthyNode(type) method iterates picker candidates, clones HistorianConfiguration per attempt with the candidate as ServerName, and returns the first successful (Connection, Node) tuple. EnsureConnected and EnsureEventConnected both call it. HandleConnectionError and HandleEventConnectionError now mark the active node failed in the picker before nulling. _activeProcessNode / _activeEventNode track the live node for the dashboard. Both silos (process + event) share a single picker instance so a node failure on one immediately benches it for the other.
Host/Status/StatusData.cs — added NodeCount, HealthyNodeCount, ActiveProcessNode, ActiveEventNode, Nodes to HistorianStatusInfo.
Host/Status/StatusReportService.cs — Historian panel renders Process Conn: open (<node>) badges and a cluster table (when NodeCount > 1) showing each node's state, cooldown expiry, failure count, and last error. Single-node deployments render a compact Node: <hostname> line.
Host/Status/HealthCheckService.cs — new Rule 2b3: Degraded when NodeCount > 1 && HealthyNodeCount < NodeCount. Lets operators alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
Host/Configuration/ConfigurationValidator.cs — logs the effective node list and FailureCooldownSeconds at startup, validates that FailureCooldownSeconds >= 0, warns when ServerName is set alongside a non-empty ServerNames.

Tests:

HistorianClusterEndpointPickerTests.cs — 19 unit tests covering config parsing, ordered iteration, cooldown expiry, zero-cooldown mode, mark-healthy clears, cumulative failure counting, unknown-node safety, concurrent writers (thread-safety smoke test).
HistorianClusterFailoverTests.cs — 6 integration tests driving HistorianDataSource via a scripted FakeHistorianConnectionFactory: first-node-fails-picks-second, all-nodes-fail, second-call-skips-cooled-down-node, single-node-legacy-behavior, picker-order-respected, shared-picker-across-silos.
Full plugin suite: 41/41 tests passing. Host suite: 446/447 (1 pre-existing flaky MxAccess monitor test passes on retry).

Live verification on instance1 (cluster = ["does-not-exist-historian.invalid", "localhost"], FailureCooldownSeconds=30):

Failover cycle 1 (fresh picker state, both nodes healthy):

2026-04-13 11:27:25.381 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
2026-04-13 11:27:25.910 [INF] Historian SDK connection opened to localhost:32568

historyread returned 1 value successfully (Queries: 1 (Success: 1, Failure: 0)).
Dashboard: panel yellow, Cluster: 1 of 2 nodes healthy, bad node cooldown until 11:27:55Z, Process Conn: open (localhost).

Cooldown expiry:

At 11:29 UTC, the cooldown window had elapsed. Panel back to green, both nodes healthy, but does-not-exist-historian.invalid retains FailureCount=1 and LastError as history.

Failover cycle 2 (service restart to drop persistent connection):

2026-04-13 14:00:39.352 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
2026-04-13 14:00:39.885 [INF] Historian SDK connection opened to localhost:32568

historyread returned 1 value successfully on the second restart cycle — proves the picker re-admits a cooled-down node and the whole failover cycle repeats cleanly.

Single-node restoration:

Changed instance1 back to "ServerNames": [], restarted. Dashboard renders Node: localhost (no cluster table), panel green, backward compat verified.

Final configuration: both instances running with empty ServerNames (single-node mode). LmxOpcUa PID 31064, LmxOpcUa2 PID 15012.

Operator configuration shape:

"Historian": {
  "Enabled": true,
  "ServerName": "localhost",                // ignored when ServerNames is non-empty
  "ServerNames": ["historian-a", "historian-b"],
  "FailureCooldownSeconds": 60,
  ...
}

Galaxy Runtime Status Probes + Subtree Quality Invalidation

Updated: 2026-04-13 15:28-16:19 America/New_York

Both instances updated with per-host Galaxy runtime status tracking ($WinPlatform + $AppEngine), proactive subtree quality invalidation when a host transitions to Stopped, and an OPC UA Read short-circuit so operators can no longer read stale-Good cached values from a dead runtime host.

This ships the feature described in the runtimestatus.md plan file. Addresses the production issue reported earlier: "when an AppEngine is set to scan off, LMX updates are received for every tag, causing OPC UA client freeze and sometimes not all OPC UA tags are set to bad quality."

Backups:

C:\publish\lmxopcua\backups\20260413-152824-instance1
C:\publish\lmxopcua\backups\20260413-152824-instance2

Deployed binary (both instances):

ZB.MOM.WW.LmxOpcUa.Host.exe — commit 98ed6bd
Two incremental deploys during verification: 15:28 (initial), 15:52 (Read-handler patch), 16:06 (dispatch-thread deadlock fix)

Windows services:

LmxOpcUa — Running, PID 29528
LmxOpcUa2 — Running, PID 30684

Code changes — what shipped

New config — MxAccessConfiguration:

RuntimeStatusProbesEnabled: bool (default true) — enables <Host>.ScanState probing for every deployed $WinPlatform and $AppEngine.
RuntimeStatusUnknownTimeoutSeconds: int (default 15) — only applies to the Unknown → Stopped transition; running hosts never time out because ScanState is delivered on-change only.

New hierarchy columns — hierarchy.sql and GalaxyObjectInfo:

CategoryId: int — populated from template_definition.category_id (1 = $WinPlatform, 3 = $AppEngine).
HostedByGobjectId: int — populated from gobject.hosted_by_gobject_id (the actual column name on this Galaxy schema; the plan document's guess of host_gobject_id was wrong). Walked up to find each variable's nearest Platform/Engine ancestor.

New domain types — Host/Domain/:

GalaxyRuntimeState enum (Unknown / Running / Stopped).
GalaxyRuntimeStatus DTO with callback/state-change timestamps, LastScanState, LastError, cumulative counters.

New probe manager — Host/MxAccess/GalaxyRuntimeProbeManager.cs:

Pure manager, no SDK leakage. AdviseSupervisorys <Host>.ScanState for every runtime host on SyncAsync.
State predicate: isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b. Everything else is Stopped.
GetSnapshot() forces every entry to Unknown when the MxAccess transport is disconnected — prevents misleading "every host stopped" display when the actual problem is the transport.
Tick() only advances Unknown → Stopped on the configured timeout; Running hosts never time out (on-change delivery semantic).
IsHostStopped(gobjectId) — used by the Read-path short-circuit; uses underlying state directly (not the snapshot force-unknown rewrite) so a transport outage doesn't double-flag reads.
Dispose() unadvises every active probe before MxAccess teardown.

New hosted-variables map — LmxNodeManager:

_hostedVariables: Dictionary<int, List<BaseDataVariableState>> — host gobject_id → list of every descendant variable, populated during BuildAddressSpace by walking each variable's HostedByGobjectId chain up to the nearest Platform/Engine. A variable hosted by an Engine inside a Platform appears in BOTH lists.
_hostIdsByTagRef: Dictionary<string, List<int>> — reverse index used by the Read short-circuit, populated alongside _hostedVariables.
Public MarkHostVariablesBadQuality(int gobjectId) — walks _hostedVariables[gobjectId], sets StatusCode = BadOutOfService on each, calls ClearChangeMasks(ctx, false) to push through the OPC UA publisher.
Public ClearHostVariablesBadQuality(int gobjectId) — inverse, resets to Good on recovery.

OPC UA Read short-circuit — LmxNodeManager.Read:

Before the normal _mxAccessClient.ReadAsync(tagRef) round-trip, check IsTagUnderStoppedHost(tagRef). If true, return a DataValue { StatusCode = BadOutOfService, Value = cachedVar?.Value } directly. Covers both direct Read requests AND OPC UA monitored-item sampling, which both flow through this override.

Deadlock fix — _pendingHostStateChanges queue:

First draft invoked MarkHostVariablesBadQuality synchronously from the probe callback. MxAccess delivers OnDataChange on the STA thread; the callback took the node manager Lock. Meanwhile any worker thread inside Read could hold Lock and wait on a pending ReadAsync that needed the STA thread — classic STA deadlock (first real deploy hung in ~30s).
Fix: probe transitions are enqueued on ConcurrentQueue<(int GobjectId, bool Stopped)> and the dispatch thread drains the queue inside its existing 100ms WaitOne loop. The dispatch thread takes Lock naturally without STA involvement, so no cycle. Live verified with the IDE OffScan/OnScan cycle after the fix.

Dashboard — Host/Status/:

New RuntimeStatusInfo DTO + "Galaxy Runtime" panel between Galaxy Info and Historian. Shows total/running/stopped/unknown counts plus a per-host table with Name / Kind / State / Since / Last Error columns. Panel color: green (all Running), yellow (some Unknown, none Stopped), red (any Stopped), gray (MxAccess disconnected forces every row to Unknown).
Subscriptions panel gets a new Probes: N (bridge-owned runtime status) line when non-zero.
HealthCheckService Rule 2e: Degraded when any host is Stopped, ordered after Rule 1 (MxAccess transport) to avoid double-messaging when the transport is the root cause.

Tests

24 new GalaxyRuntimeProbeManagerTests: state transitions (Unknown/Running/Stopped/recovery), unknown-resolution timeout, transport gating, sync diff, dispose, callback exception safety, IsHostStopped for Read-path short-circuit (Unknown/Running/Stopped/recovery/unknown-id/transport-disconnected-contract).
Full Host suite: 471/471 tests passing. No regressions.

Live end-to-end verification (today, against real IDE OffScan action)

Baseline (before OffScan, dashboard at 15:44:00):

Galaxy Runtime: green, 2 of 2 hosts running
DevAppEngine     $AppEngine     Running  2026-04-13T19:29:12.9475357Z
DevPlatform      $WinPlatform   Running  2026-04-13T19:29:12.9345208Z
TestMachine_001.MachineID → Status 0x00000000 (Good), value "admin_test"

After operator Set OffScan on DevAppEngine in IDE (log at 15:44:25):

15:44:25.554  Galaxy runtime DevAppEngine.ScanState transitioned Running → Stopped (ScanState = false (OffScan))
15:44:25.557  Marked 3971 variable(s) BadOutOfService for stopped host gobject_id=1043

Dashboard: red panel, 1 of 2 hosts running (1 stopped, 0 unknown). Health: Degraded — Galaxy runtime has 1 of 2 host(s) stopped: DevAppEngine. Critical: 3ms from probe callback to subtree walk complete.

Read during stop — found bug #1 (Read handler bypassed cached state):

Initial deploy: TestMachine_001.MachineID still read 0x00000000 Good with a post-stop source time from MxAccess. Revealed that LmxNodeManager.Read calls _mxAccessClient.ReadAsync() directly and never consults the in-memory BaseDataVariableState.StatusCode we set during the walk.
Fix: IsTagUnderStoppedHost short-circuit in Read override. After patch: [808D0000] BadOutOfService on all three test tags.

Read during stop — found bug #2 (deadlock):

After shipping the Read patch, the service hung on the next OffScan. HTTP listener accepted connections but never responded, and service shutdown stuck at STOP_PENDING for 15+ seconds until manually killed.
Diagnosis: the probe callback fires HandleProbeUpdate → MarkHostVariablesBadQuality → acquires Lock on the STA thread. Meanwhile the dispatch thread can sit inside Read holding Lock and waiting for an STA-routed ReadAsync. Circular wait.
Fix: enqueue probe transitions onto ConcurrentQueue and drain on the dispatch thread where Lock acquisition is safe. Second deploy resolved the hang.

A/B verification (instance1 patched, instance2 not yet):

Instance	`TestMachine_001.MachineID`
`LmxOpcUa` (patched)	`0x808D0000` BadOutOfService ✅
`LmxOpcUa2` (old)	`0x00000000` Good, stale ❌

Clean A/B confirmed the Read patch is required; instance2 subsequently updated to match.

Recovery (operator Set OnScan on DevAppEngine, log at 16:10:05):

16:10:05.129  Galaxy runtime DevAppEngine.ScanState transitioned → Running
16:10:05.130  Cleared bad-quality override on 3971 variable(s) for recovered host gobject_id=1043

Dashboard: back to green, DevAppEngine Running with new Since = 20:10:05.129Z. All three test tags back to 0x00000000 Good with fresh source timestamps. 1ms from probe callback to subtree clear.

Client freeze observation — phase 2 decision gate

The original production issue had two symptoms: (1) incomplete quality flip and (2) OPC UA client freeze. The subtree walk + Read short-circuit fixes (1) definitively. For (2), there's still a pending dispatch-queue flood of per-tag MxAccess callbacks that MxAccess fans out when a host stops — the bridge doesn't currently drop them. We deliberately did not ship dispatch suppression in this pass, on the grounds that the subtree walk may coalesce notifications sufficiently at the SDK publisher level to resolve the freeze on its own. The verification against the live Galaxy with no OPC UA clients subscribed doesn't tell us one way or the other — the next subscribed-client test against a real stop will be the deciding measurement. If the client still freezes after the walk, phase 2 adds pre-dispatch filtering for tags under Stopped hosts.

What's deferred

Synthetic OPC UA child nodes ($RuntimeState, $LastCallbackTime, etc.) under each host object. Dashboard + health surface give operators visibility today; the OPC UA synthetic nodes are a follow-up.
Dispatch suppression — gated on observing whether the subtree walk alone resolves the client freeze in production.
Documentation updates — the docs/ guides (MxAccessBridge.md, StatusDashboard.md, Configuration.md, HistoricalDataAccess.md) still describe the pre-runtime-status behavior. Need a consolidated doc pass covering this feature plus the historian cluster + health surface updates from earlier today.

Notes

The service deployment and restart succeeded. The live CLI checks confirm the endpoint is reachable and that the array node identifier has changed to the bracketless form. The array value on the live service still prints as blank even though the status is good, so if this environment should have populated MoveInPartNumbers, the runtime data path still needs follow-up investigation.

40 KiB Raw Blame History

Service Update Summary

CLI Verification

Instance 2 (Redundant Secondary)

LDAP Authentication Update

Alarm Notifier Chain Update

Auth Consolidation Update

Granular Write Roles Update

Historian SDK Migration

HistoryServerCapabilities and Continuation Points

Remaining Historian Gaps Fix

Historical Event Access

Data Access Gaps Fix

Alarms & Conditions Gaps Fix

Security Gaps Fix

Historian Plugin Runtime Load + Dashboard Health

Endpoints Panel on Dashboard

Template-Based Alarm Object Filter

Historian Runtime Health Surface

Historian Read-Only Cluster Support

Galaxy Runtime Status Probes + Subtree Quality Invalidation

Code changes — what shipped

Tests

Live end-to-end verification (today, against real IDE OffScan action)

Client freeze observation — phase 2 decision gate

What's deferred

Notes

40 KiB

Raw Blame History