lmxopcua/service_info.md

# Service Update Summary

Updated service instance: `C:\publish\lmxopcua\instance1`

Update time: `2026-03-25 12:54-12:55 America/New_York`

Backup created before deploy: `C:\publish\lmxopcua\backups\20260325-125444`

Configuration preserved:
- `C:\publish\lmxopcua\instance1\appsettings.json` was not overwritten.

Deployed binary:
- `C:\publish\lmxopcua\instance1\ZB.MOM.WW.LmxOpcUa.Host.exe`
- Last write time: `2026-03-25 12:53:58`
- Size: `143360`

Windows service:
- Name: `LmxOpcUa`
- Display name: `LMX OPC UA Server`
- Account: `LocalSystem`
- Status after update: `Running`
- Process ID after restart: `29236`

Restart evidence:
- Service log file: `C:\publish\lmxopcua\instance1\logs\lmxopcua-20260325_004.log`
- Last startup line: `2026-03-25 12:55:08.619 -04:00 [INF] The LmxOpcUa service was started.`

## CLI Verification

Endpoint from deployed config:
- `opc.tcp://localhost:4840/LmxOpcUa`

CLI used:
- `C:\Users\dohertj2\Desktop\lmxopcua\tools\opcuacli-dotnet\bin\Debug\net10.0\opcuacli-dotnet.exe`

Commands run:

```powershell
opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa
opcuacli-dotnet.exe read -u opc.tcp://localhost:4840/LmxOpcUa -n 'ns=1;s=MESReceiver_001.MoveInPartNumbers'
opcuacli-dotnet.exe read -u opc.tcp://localhost:4840/LmxOpcUa -n 'ns=1;s=MESReceiver_001.MoveInPartNumbers[]'
```

Observed results:
- `connect`: succeeded, server reported as `LmxOpcUa`.
- `read ns=1;s=MESReceiver_001.MoveInPartNumbers`: succeeded with good status `0x00000000`.
- `read ns=1;s=MESReceiver_001.MoveInPartNumbers[]`: failed with `BadNodeIdUnknown` (`0x80340000`).

---

## Instance 2 (Redundant Secondary)

Deployed: `2026-03-28`

Deployment path: `C:\publish\lmxopcua\instance2`

Configuration:
- `OpcUa.Port`: `4841`
- `OpcUa.ServerName`: `LmxOpcUa2`
- `OpcUa.ApplicationUri`: `urn:localhost:LmxOpcUa:instance2`
- `Dashboard.Port`: `8082`
- `MxAccess.ClientName`: `LmxOpcUa2`
- `Redundancy.Enabled`: `true`
- `Redundancy.Mode`: `Warm`
- `Redundancy.Role`: `Secondary`
- `Redundancy.ServerUris`: `["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"]`

Windows service:
- Name: `LmxOpcUa2`
- Display name: `LMX OPC UA Server (Instance 2)`
- Account: `LocalSystem`
- Endpoint: `opc.tcp://localhost:4841/LmxOpcUa`

Instance 1 redundancy update (same date):
- `OpcUa.ApplicationUri`: `urn:localhost:LmxOpcUa:instance1`
- `Redundancy.Enabled`: `true`
- `Redundancy.Mode`: `Warm`
- `Redundancy.Role`: `Primary`
- `Redundancy.ServerUris`: `["urn:localhost:LmxOpcUa:instance1", "urn:localhost:LmxOpcUa:instance2"]`

CLI verification:
```
opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa
  → Redundancy Mode: Warm, Service Level: 200, Application URI: urn:localhost:LmxOpcUa:instance1

opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa
  → Redundancy Mode: Warm, Service Level: 150, Application URI: urn:localhost:LmxOpcUa:instance2
```

Both instances report the same `ServerUriArray` and expose the same Galaxy namespace (`urn:ZB:LmxOpcUa`).

## LDAP Authentication Update

Updated: `2026-03-28`

Both instances updated to use LDAP authentication via GLAuth.

Configuration changes (both instances):
- `Authentication.AllowAnonymous`: `true` (anonymous can browse/read)
- `Authentication.AnonymousCanWrite`: `false` (anonymous writes blocked)
- `Authentication.Ldap.Enabled`: `true`
- `Authentication.Ldap.Host`: `localhost`
- `Authentication.Ldap.Port`: `3893`
- `Authentication.Ldap.BaseDN`: `dc=lmxopcua,dc=local`

LDAP server: GLAuth v2.4.0 at `C:\publish\glauth\` (Windows service: `GLAuth`)

Permission verification (instance1, port 4840):
```
anonymous read    → allowed
anonymous write   → denied  (BadUserAccessDenied)
readonly  read    → allowed
readonly  write   → denied  (BadUserAccessDenied)
readwrite write   → allowed
admin     write   → allowed
alarmack  write   → denied  (BadUserAccessDenied)
bad password      → denied  (connection rejected)
```

## Alarm Notifier Chain Update

Updated: `2026-03-28`

Both instances updated with alarm event propagation up the notifier chain.

Code changes:
- Alarm events now walk up the parent chain (`ReportEventUpNotifierChain`), reporting to every ancestor node
- `EventNotifier = SubscribeToEvents` is set on all ancestors of alarm-containing nodes (`EnableEventNotifierUpChain`)
- Removed separate `Server.ReportEvent` call (no longer needed — the walk reaches the root)

No configuration changes required — alarm tracking was already enabled (`AlarmTrackingEnabled: true`).

Verification (instance1, port 4840):
```
alarms --node TestArea --refresh:
  TestMachine_001.TestAlarm001 → visible (Severity=500, Retain=True)
  TestMachine_001.TestAlarm002 → visible (Severity=500, Retain=True)
  TestMachine_001.TestAlarm003 → visible (Severity=500, Retain=True)
  TestMachine_002.TestAlarm001 → visible (Severity=500, Retain=True)
  TestMachine_002.TestAlarm003 → visible (Severity=500, Retain=True)

alarms --node DEV --refresh:
  Same 5 alarms visible at DEV (grandparent) level
```

## Auth Consolidation Update

Updated: `2026-03-28`

Both instances updated to consolidate LDAP roles into OPC UA session roles (`RoleBasedIdentity.GrantedRoleIds`).

Code changes:
- LDAP groups now map to custom OPC UA role NodeIds in `urn:zbmom:lmxopcua:roles` namespace
- Roles stored on session identity via `GrantedRoleIds` — no username-to-role side cache
- Permission checks use `GrantedRoleIds.Contains()` instead of username extraction
- `AnonymousCanWrite` behavior is consistent regardless of LDAP state
- Galaxy namespace moved from `ns=2` to `ns=3` (roles namespace is `ns=2`)

No configuration changes required.

Verification (instance1, port 4840):
```
anonymous read      → allowed
anonymous write     → denied  (BadUserAccessDenied, AnonymousCanWrite=false)
readonly  write     → denied  (BadUserAccessDenied)
readwrite write     → allowed
admin     write     → allowed
alarmack  write     → denied  (BadUserAccessDenied)
bad password        → rejected (connection failed)
```

## Granular Write Roles Update

Updated: `2026-03-28`

Both instances updated with granular write roles replacing the single ReadWrite role.

Code changes:
- `ReadWrite` role replaced by `WriteOperate`, `WriteTune`, `WriteConfigure`
- Write permission checks now consider the Galaxy security classification of the target attribute
- `SecurityClassification` stored in `TagMetadata` for per-node lookup at write time

GLAuth changes:
- New groups: `WriteOperate` (5502), `WriteTune` (5504), `WriteConfigure` (5505)
- New users: `writeop`, `writetune`, `writeconfig`
- `admin` user added to all groups (5502, 5503, 5504, 5505)

Config changes (both instances):
- `Authentication.Ldap.ReadWriteGroup` replaced by `WriteOperateGroup`, `WriteTuneGroup`, `WriteConfigureGroup`

Verification (instance1, port 4840, Operate-classified attributes):
```
anonymous read        → allowed
anonymous write       → denied  (AnonymousCanWrite=false)
readonly  write       → denied  (no write role)
writeop   write       → allowed (WriteOperate matches Operate classification)
writetune write       → denied  (WriteTune doesn't match Operate)
writeconfig write     → denied  (WriteConfigure doesn't match Operate)
admin     write       → allowed (has all write roles)
```

## Historian SDK Migration

Updated: `2026-04-06`

Both instances updated to use the Wonderware Historian SDK (`aahClientManaged.dll`) instead of direct SQL queries for historical data access.

Code changes:
- `HistorianDataSource` rewritten from `SqlConnection`/`SqlDataReader` to `ArchestrA.HistorianAccess` SDK
- Persistent connection with lazy connect and auto-reconnect on failure
- `HistorianConfiguration.ConnectionString` replaced with `ServerName`, `IntegratedSecurity`, `UserName`, `Password`, `Port`
- `HistorianDataSource` now implements `IDisposable`, disposed on service shutdown
- `ConfigurationValidator` validates Historian SDK settings at startup

SDK DLLs deployed to both instances:
- `aahClientManaged.dll` (primary SDK, v2.0.0.0)
- `aahClient.dll`, `aahClientCommon.dll` (dependencies)
- `Historian.CBE.dll`, `Historian.DPAPI.dll`, `ArchestrA.CloudHistorian.Contract.dll`

Configuration changes (both instances):
- `Historian.ConnectionString` removed
- `Historian.ServerName`: `"localhost"`
- `Historian.IntegratedSecurity`: `true`
- `Historian.Port`: `32568`
- `Historian.Enabled`: `true` (unchanged)

Verification (instance1 startup log):
```
Historian.Enabled=true, ServerName=localhost, IntegratedSecurity=true, Port=32568
Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000
=== Configuration Valid ===
LmxOpcUa service started successfully
```

## HistoryServerCapabilities and Continuation Points

Updated: `2026-04-06`

Both instances updated with OPC UA Part 11 spec compliance improvements.

Code changes:
- `HistoryServerCapabilities` node populated under `ServerCapabilities` with all boolean capability properties
- `AggregateFunctions` folder populated with references to 7 supported aggregate functions
- `HistoryContinuationPointManager` added — stores remaining data when results exceed `NumValuesPerNode`
- `HistoryReadRawModified` and `HistoryReadProcessed` now return `ContinuationPoint` in `HistoryReadResult` for partial reads
- Follow-up requests with `ContinuationPoint` resume from stored state; invalid/expired points return `BadContinuationPointInvalid`

No configuration changes required.

Verification (instance1 startup log):
```
HistoryServerCapabilities configured with 7 aggregate functions
LmxOpcUa service started successfully
```

## Remaining Historian Gaps Fix

Updated: `2026-04-06`

Both instances updated with remaining OPC UA Part 11 spec compliance fixes.

Code changes:
- **Gap 4**: `HistoryReadRawModified` returns `BadHistoryOperationUnsupported` when `IsReadModified=true`
- **Gap 5**: `HistoryReadAtTime` override added with `ReadAtTimeAsync` using SDK `HistorianRetrievalMode.Interpolated`
- **Gap 8**: `HistoricalDataConfigurationState` child nodes added to historized variables (`Stepped=false`, `Definition="Wonderware Historian"`)
- **Gap 10**: `ReturnBounds` parameter handled — boundary `DataValue` entries with `BadBoundNotFound` inserted at StartTime/EndTime
- **Gap 11**: `StandardDeviation` aggregate added to client enum, mapper, CLI (aliases: `stddev`/`stdev`), and UI dropdown

No configuration changes required.

## Historical Event Access

Updated: `2026-04-06`

Both instances updated with OPC UA historical event access (Gap 7).

Code changes:
- `HistorianDataSource.ReadEventsAsync` queries Historian event store via separate `HistorianConnectionType.Event` connection
- `LmxNodeManager.HistoryReadEvents` override maps `HistorianEvent` records to OPC UA `HistoryEventFieldList` entries
- `AccessHistoryEventsCapability` set to `true` when `AlarmTrackingEnabled` is true
- Event fields: EventId, EventType, SourceNode, SourceName, Time, ReceiveTime, Message, Severity

No configuration changes required. All historian gaps (1-11) are now resolved.

## Data Access Gaps Fix

Updated: `2026-04-06`

Both instances updated with OPC UA DA spec compliance fixes.

Code changes:
- `ConfigureServerCapabilities()` populates `ServerCapabilities` node: `ServerProfileArray`, `LocaleIdArray`, `MinSupportedSampleRate`, continuation point limits, array/string limits, and 12 `OperationLimits` values
- `Server_ServerDiagnostics_EnabledFlag` set to `true` — SDK auto-tracks session/subscription counts
- `OnModifyMonitoredItemsComplete` override logs monitored item modifications

No configuration changes required. All DA gaps (1-8) resolved.

## Alarms & Conditions Gaps Fix

Updated: `2026-04-06`

Both instances updated with OPC UA Part 9 alarm spec compliance fixes.

Code changes:
- Wired `OnConfirm`, `OnAddComment`, `OnEnableDisable`, `OnShelve`, `OnTimedUnshelve` handlers on each `AlarmConditionState`
- Shelving: `SetShelvingState()` manages `TimedShelve`, `OneShotShelve`, `Unshelve` state machine
- `ReportAlarmEvent` now populates `LocalTime` (timezone offset + DST) and `Quality` event fields
- Flaky `Monitor_ProbeDataChange_PreventsStaleReconnect` test fixed (increased stale threshold from 2s to 5s)

No configuration changes required. All A&C gaps (1-10) resolved.

## Security Gaps Fix

Updated: `2026-04-06`

Both instances updated with OPC UA Part 2/4/7 security spec compliance fixes.

Code changes:
- `SecurityProfileResolver`: Added 4 modern AES profiles (`Aes128_Sha256_RsaOaep-Sign/SignAndEncrypt`, `Aes256_Sha256_RsaPss-Sign/SignAndEncrypt`)
- `OnImpersonateUser`: Added `X509IdentityToken` handling with CN extraction and role assignment
- `BuildUserTokenPolicies`: Advertises `UserTokenType.Certificate` when non-None security profiles are configured
- `OnCertificateValidation`: Enhanced logging with certificate thumbprint, subject, and expiry
- Authentication audit logging: `AUDIT:` prefixed log entries for success/failure with session ID and roles

No configuration changes required. All security gaps (1-10) resolved.

## Historian Plugin Runtime Load + Dashboard Health

Updated: `2026-04-12 18:47-18:49 America/New_York`

Both instances updated to the latest build. Brings in the runtime-loaded Historian plugin (`Historian/` subfolder next to the Host) and the status dashboard health surface for historian plugin + alarm-tracking misconfiguration.

Backups created before deploy:
- `C:\publish\lmxopcua\backups\20260412-184713-instance1`
- `C:\publish\lmxopcua\backups\20260412-184713-instance2`

Configuration preserved:
- `C:\publish\lmxopcua\instance1\appsettings.json` was not overwritten.
- `C:\publish\lmxopcua\instance2\appsettings.json` was not overwritten.

Layout change:
- Flat historian interop DLLs removed from each instance root (`aahClient*.dll`, `ArchestrA.CloudHistorian.Contract.dll`, `Historian.CBE.dll`, `Historian.DPAPI.dll`).
- Historian plugin + interop DLLs now live under `<instance>\Historian\` (including `ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll`), loaded by `HistorianPluginLoader`.

Deployed binary (both instances):
- `ZB.MOM.WW.LmxOpcUa.Host.exe`
- Last write time: `2026-04-12 18:46:22 -04:00`
- Size: `7938048`

Windows services:
- `LmxOpcUa` — Running, PID `40176`
- `LmxOpcUa2` — Running, PID `34400`

Restart evidence (instance1 `logs/lmxopcua-20260412.log`):
```
2026-04-12 18:48:02.968 -04:00 [INF] Historian.Enabled=true, ServerName=localhost, IntegratedSecurity=true, Port=32568
2026-04-12 18:48:02.971 -04:00 [INF] === Configuration Valid ===
2026-04-12 18:48:09.658 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance1\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-12 18:48:13.691 -04:00 [INF] LmxOpcUa service started successfully
```

Restart evidence (instance2 `logs/lmxopcua-20260412.log`):
```
2026-04-12 18:49:08.152 -04:00 [INF] Historian.Enabled=true, ServerName=localhost, IntegratedSecurity=true, Port=32568
2026-04-12 18:49:08.155 -04:00 [INF] === Configuration Valid ===
2026-04-12 18:49:14.744 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance2\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-12 18:49:18.777 -04:00 [INF] LmxOpcUa service started successfully
```

CLI verification (via `dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI`):
```
connect    opc.tcp://localhost:4840/LmxOpcUa → Server: LmxOpcUa
connect    opc.tcp://localhost:4841/LmxOpcUa → Server: LmxOpcUa2
redundancy opc.tcp://localhost:4840/LmxOpcUa → Warm, ServiceLevel=200, urn:localhost:LmxOpcUa:instance1
redundancy opc.tcp://localhost:4841/LmxOpcUa → Warm, ServiceLevel=150, urn:localhost:LmxOpcUa:instance2
```

Both instances report the same `ServerUriArray` and the primary advertises the higher ServiceLevel, matching the prior redundancy baseline.

## Endpoints Panel on Dashboard

Updated: `2026-04-13 08:46-08:50 America/New_York`

Both instances updated with a new `Endpoints` panel on the status dashboard surfacing the opc.tcp base addresses, active OPC UA security profiles (mode + policy name + full URI), and user token policies.

Code changes:
- `StatusData.cs` — added `EndpointsInfo` / `SecurityProfileInfo` DTOs on `StatusData`.
- `OpcUaServerHost.cs` — added `BaseAddresses`, `SecurityPolicies`, `UserTokenPolicies` runtime accessors reading `ApplicationConfiguration.ServerConfiguration` live state.
- `StatusReportService.cs` — builds `EndpointsInfo` from the host and renders a new panel with a graceful empty state when the server is not started.

No configuration changes required.

Verification (instance1 @ `http://localhost:8085/`):
```
Base Addresses: opc.tcp://localhost:4840/LmxOpcUa
Security Profiles: None / None / http://opcfoundation.org/UA/SecurityPolicy#None
User Token Policies: Anonymous, UserName
```

Verification (instance2 @ `http://localhost:8086/`):
```
Base Addresses: opc.tcp://localhost:4841/LmxOpcUa
Security Profiles: None / None / http://opcfoundation.org/UA/SecurityPolicy#None
User Token Policies: Anonymous, UserName
```

## Template-Based Alarm Object Filter

Updated: `2026-04-13 09:39-09:43 America/New_York`

Both instances updated with a new configurable alarm object filter. When `OpcUa.AlarmFilter.ObjectFilters` is non-empty, only Galaxy objects whose template derivation chain matches a pattern (and their containment-tree descendants) contribute `AlarmConditionState` nodes. When the list is empty, the current unfiltered behavior is preserved (backward-compatible default).

Backups created before deploy:
- `C:\publish\lmxopcua\backups\20260413-093900-instance1`
- `C:\publish\lmxopcua\backups\20260413-093900-instance2`

Deployed binary (both instances):
- `ZB.MOM.WW.LmxOpcUa.Host.exe`
- Last write time: `2026-04-13 09:38:46 -04:00`
- Size: `7951360`

Windows services:
- `LmxOpcUa` — Running, PID `40900`
- `LmxOpcUa2` — Running, PID `29936`

Code changes:
- `gr/queries/hierarchy.sql` — added recursive CTE on `gobject.derived_from_gobject_id` and a new `template_chain` column (pipe-delimited, innermost template first).
- `Domain/GalaxyObjectInfo.cs` — added `TemplateChain: List<string>` populated from the new SQL column.
- `GalaxyRepositoryService.cs` — reads the new column and splits into `TemplateChain`.
- `Configuration/AlarmFilterConfiguration.cs` (new) — `List<string> ObjectFilters`; entries may themselves be comma-separated. Attached to `OpcUaConfiguration.AlarmFilter`.
- `Configuration/ConfigurationValidator.cs` — logs the effective filter and warns if patterns are configured while `AlarmTrackingEnabled == false`.
- `Domain/AlarmObjectFilter.cs` (new) — compiles wildcard patterns (`*` only) to case-insensitive regexes with Galaxy `$` prefix normalized on both sides; walks the hierarchy top-down with cycle defense; returns a `HashSet<int>` of included gobject IDs plus `UnmatchedPatterns` for startup warnings.
- `OpcUa/LmxNodeManager.cs` — constructor accepts the filter; the two alarm-creation loops (`BuildAddressSpace` full build and the subtree rebuild path) both call `ResolveAlarmFilterIncludedIds(sorted)` and skip any object not in the resolved set. New public properties expose filter state to the dashboard: `AlarmFilterEnabled`, `AlarmFilterPatternCount`, `AlarmFilterIncludedObjectCount`.
- `OpcUa/OpcUaServerHost.cs`, `OpcUa/LmxOpcUaServer.cs`, `OpcUaService.cs`, `OpcUaServiceBuilder.cs` — plumbing to construct and thread the filter from `appsettings.json` down to the node manager.
- `Status/StatusData.cs` + `Status/StatusReportService.cs` — `AlarmStatusInfo` gains `FilterEnabled`, `FilterPatternCount`, `FilterIncludedObjectCount`; a filter summary line renders in the Alarms panel when the filter is active.

Tests:
- 36 new unit tests in `tests/.../Domain/AlarmObjectFilterTests.cs` covering pattern parsing, wildcard semantics, regex escaping, Galaxy `$` normalization, template-chain matching, subtree propagation, set semantics, orphan/cycle defense, and `UnmatchedPatterns` tracking.
- 5 new integration tests in `tests/.../Integration/AlarmObjectFilterIntegrationTests.cs` spinning up a real `LmxNodeManager` via `OpcUaServerFixture` and asserting `AlarmConditionCount`/`AlarmFilterIncludedObjectCount` under various filters.
- 1 new Status test verifying JSON exposes the filter counters.
- Full suite: **446/446 tests passing** (no regressions).

Configuration change: both instances have `OpcUa.AlarmFilter.ObjectFilters: []` (filter disabled, unfiltered alarm tracking preserved).

Live verification against instance1 Galaxy (filter temporarily set to `"TestMachine"`):
```
2026-04-13 09:41:31 [INF] OpcUa.AlarmTrackingEnabled=true, AlarmFilter.ObjectFilters=[TestMachine]
2026-04-13 09:41:42 [INF] Alarm filter: 42 of 49 objects included (1 pattern(s))
Dashboard Alarms panel: Tracking: True | Conditions: 60 | Active: 4
                        Filter: 1 pattern(s), 42 object(s) included
```

Final configuration restored to empty filter. Dashboard confirms unfiltered behavior on both endpoints:
```
instance1 @ http://localhost:8085/ → Conditions: 60 | Active: 4 (no filter line)
instance2 @ http://localhost:8086/ → Conditions: 60 | Active: 4 (no filter line)
```

Filter syntax quick reference (documented in `AlarmFilterConfiguration.cs` XML-doc):
- `*` is the only wildcard (glob-style; zero or more characters).
- Matching is case-insensitive and ignores the Galaxy leading `$` template prefix on both the pattern and the stored chain entry, so operators write `TestMachine*` not `$TestMachine*`.
- Each entry may contain comma-separated patterns for convenience (e.g., `"TestMachine*, Pump_*"`).
- Empty list → filter disabled → current unfiltered behavior.
- Match semantics: an object is included when any template in its derivation chain matches any pattern, and the inclusion propagates to all descendants in the containment hierarchy. Each object is evaluated once regardless of how many patterns or ancestors match.

## Historian Runtime Health Surface

Updated: `2026-04-13 10:44-10:52 America/New_York`

Both instances updated with runtime historian query instrumentation so the status dashboard can detect silent query degradation that the load-time `PluginStatus` cannot catch.

Backups:
- `C:\publish\lmxopcua\backups\20260413-104406-instance1`
- `C:\publish\lmxopcua\backups\20260413-104406-instance2`

Code changes:
- `Host/Historian/HistorianHealthSnapshot.cs` (new) — DTO with `TotalQueries`, `TotalSuccesses`, `TotalFailures`, `ConsecutiveFailures`, `LastSuccessTime`, `LastFailureTime`, `LastError`, `ProcessConnectionOpen`, `EventConnectionOpen`.
- `Host/Historian/IHistorianDataSource.cs` — added `GetHealthSnapshot()` interface method.
- `Historian.Aveva/HistorianDataSource.cs` — added `_healthLock`-guarded counters, `RecordSuccess()` / `RecordFailure(path)` helpers called at every terminal site in all four read methods (raw, aggregate, at-time, events). Error messages carry a `raw:` / `aggregate:` / `at-time:` / `events:` prefix so operators can tell which SDK call is broken.
- `Host/OpcUa/LmxNodeManager.cs` — exposes `HistorianHealth` property that proxies to `IHistorianDataSource.GetHealthSnapshot()`.
- `Host/Status/StatusData.cs` — added 9 new fields on `HistorianStatusInfo`.
- `Host/Status/StatusReportService.cs` — `BuildHistorianStatusInfo()` populates the new fields from the node manager; panel color gradient: green → yellow (1-4 consecutive failures) → red (≥5 consecutive or plugin unloaded). Renders `Queries: N (Success: X, Failure: Y) | Consecutive Failures: Z`, `Process Conn: open/closed | Event Conn: open/closed`, plus `Last Success:` / `Last Failure:` / `Last Error:` lines when applicable.
- `Host/Status/HealthCheckService.cs` — new Rule 2b2: `Degraded` when `ConsecutiveFailures >= 3`. Threshold chosen to avoid flagging single transient blips.

Tests:
- 5 new unit tests in `HistorianDataSourceLifecycleTests` covering fresh zero-state, single failure, multi-failure consecutive increment, cross-read-path counting, and error-message-carries-path.
- Full suite: 16/16 plugin tests, 447/447 host tests passing.

Live verification on instance1:
```
Before any query:
  Queries: 0 (Success: 0, Failure: 0) | Process Conn: closed | Event Conn: closed
After TestMachine_001.TestHistoryValue raw read:
  Queries: 1 (Success: 1, Failure: 0) | Process Conn: open
  Last Success: 2026-04-13T14:45:18Z
After aggregate hourly-average over 24h:
  Queries: 2 (Success: 2, Failure: 0)
After historyread against an unknown node id (bad tag):
  Queries: 2 (counter unchanged — rejected at node-lookup before reaching the plugin; correct)
```

JSON endpoint `/api/status` carries all 9 new fields with correct types. Both instances deployed; instance1 `LmxOpcUa` PID 33824, instance2 `LmxOpcUa2` PID 30200.

## Historian Read-Only Cluster Support

Updated: `2026-04-13 11:25-12:00 America/New_York`

Both instances updated with Wonderware Historian read-only cluster failover. Operators can supply an ordered list of historian cluster nodes; the plugin iterates them on each fresh connect and benches failed nodes for a configurable cooldown window. Single-node deployments are preserved via the existing `ServerName` field.

Backups:
- `C:\publish\lmxopcua\backups\20260413-112519-instance1`
- `C:\publish\lmxopcua\backups\20260413-112519-instance2`

Code changes:
- `Host/Configuration/HistorianConfiguration.cs` — added `ServerNames: List<string>` (defaults to `[]`) and `FailureCooldownSeconds: int` (defaults to 60). `ServerName` preserved as fallback when `ServerNames` is empty.
- `Host/Historian/HistorianClusterNodeState.cs` (new) — per-node DTO: `Name`, `IsHealthy`, `CooldownUntil`, `FailureCount`, `LastError`, `LastFailureTime`.
- `Host/Historian/HistorianHealthSnapshot.cs` — extended with `ActiveProcessNode`, `ActiveEventNode`, `NodeCount`, `HealthyNodeCount`, `Nodes: List<HistorianClusterNodeState>`.
- `Historian.Aveva/HistorianClusterEndpointPicker.cs` (new, internal) — pure picker with injected clock, thread-safe via lock, BFS-style `GetHealthyNodes()` / `MarkFailed()` / `MarkHealthy()` / `SnapshotNodeStates()`. Nodes iterate in configuration order; failed nodes skip until cooldown elapses; the cumulative `FailureCount` and `LastError` are retained across recovery for operator diagnostics.
- `Historian.Aveva/HistorianDataSource.cs` — new `ConnectToAnyHealthyNode(type)` method iterates picker candidates, clones `HistorianConfiguration` per attempt with the candidate as `ServerName`, and returns the first successful `(Connection, Node)` tuple. `EnsureConnected` and `EnsureEventConnected` both call it. `HandleConnectionError` and `HandleEventConnectionError` now mark the active node failed in the picker before nulling. `_activeProcessNode` / `_activeEventNode` track the live node for the dashboard. Both silos (process + event) share a single picker instance so a node failure on one immediately benches it for the other.
- `Host/Status/StatusData.cs` — added `NodeCount`, `HealthyNodeCount`, `ActiveProcessNode`, `ActiveEventNode`, `Nodes` to `HistorianStatusInfo`.
- `Host/Status/StatusReportService.cs` — Historian panel renders `Process Conn: open (<node>)` badges and a cluster table (when `NodeCount > 1`) showing each node's state, cooldown expiry, failure count, and last error. Single-node deployments render a compact `Node: <hostname>` line.
- `Host/Status/HealthCheckService.cs` — new Rule 2b3: `Degraded` when `NodeCount > 1 && HealthyNodeCount < NodeCount`. Lets operators alert on a partially-failed cluster even while queries are still succeeding via the remaining nodes.
- `Host/Configuration/ConfigurationValidator.cs` — logs the effective node list and `FailureCooldownSeconds` at startup, validates that `FailureCooldownSeconds >= 0`, warns when `ServerName` is set alongside a non-empty `ServerNames`.

Tests:
- `HistorianClusterEndpointPickerTests.cs` — 19 unit tests covering config parsing, ordered iteration, cooldown expiry, zero-cooldown mode, mark-healthy clears, cumulative failure counting, unknown-node safety, concurrent writers (thread-safety smoke test).
- `HistorianClusterFailoverTests.cs` — 6 integration tests driving `HistorianDataSource` via a scripted `FakeHistorianConnectionFactory`: first-node-fails-picks-second, all-nodes-fail, second-call-skips-cooled-down-node, single-node-legacy-behavior, picker-order-respected, shared-picker-across-silos.
- Full plugin suite: 41/41 tests passing. Host suite: 446/447 (1 pre-existing flaky MxAccess monitor test passes on retry).

Live verification on instance1 (cluster = `["does-not-exist-historian.invalid", "localhost"]`, `FailureCooldownSeconds=30`):

**Failover cycle 1** (fresh picker state, both nodes healthy):
```
2026-04-13 11:27:25.381 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
2026-04-13 11:27:25.910 [INF] Historian SDK connection opened to localhost:32568
```
- historyread returned 1 value successfully (`Queries: 1 (Success: 1, Failure: 0)`).
- Dashboard: panel yellow, `Cluster: 1 of 2 nodes healthy`, bad node `cooldown` until `11:27:55Z`, `Process Conn: open (localhost)`.

**Cooldown expiry**:
- At 11:29 UTC, the cooldown window had elapsed. Panel back to green, both nodes healthy, but `does-not-exist-historian.invalid` retains `FailureCount=1` and `LastError` as history.

**Failover cycle 2** (service restart to drop persistent connection):
```
2026-04-13 14:00:39.352 [WRN] Historian node does-not-exist-historian.invalid failed during connect attempt; trying next candidate
2026-04-13 14:00:39.885 [INF] Historian SDK connection opened to localhost:32568
```
- historyread returned 1 value successfully on the second restart cycle — proves the picker re-admits a cooled-down node and the whole failover cycle repeats cleanly.

**Single-node restoration**:
- Changed instance1 back to `"ServerNames": []`, restarted. Dashboard renders `Node: localhost` (no cluster table), panel green, backward compat verified.

Final configuration: both instances running with empty `ServerNames` (single-node mode). `LmxOpcUa` PID 31064, `LmxOpcUa2` PID 15012.

Operator configuration shape:
```json
"Historian": {
  "Enabled": true,
  "ServerName": "localhost",                // ignored when ServerNames is non-empty
  "ServerNames": ["historian-a", "historian-b"],
  "FailureCooldownSeconds": 60,
  ...
}
```

## Galaxy Runtime Status Probes + Subtree Quality Invalidation

Updated: `2026-04-13 15:28-16:19 America/New_York`

Both instances updated with per-host Galaxy runtime status tracking ($WinPlatform + $AppEngine), proactive subtree quality invalidation when a host transitions to Stopped, and an OPC UA Read short-circuit so operators can no longer read stale-Good cached values from a dead runtime host.

This ships the feature described in the `runtimestatus.md` plan file. Addresses the production issue reported earlier: "when an AppEngine is set to scan off, LMX updates are received for every tag, causing OPC UA client freeze and sometimes not all OPC UA tags are set to bad quality."

Backups:
- `C:\publish\lmxopcua\backups\20260413-152824-instance1`
- `C:\publish\lmxopcua\backups\20260413-152824-instance2`

Deployed binary (both instances):
- `ZB.MOM.WW.LmxOpcUa.Host.exe` — commit `98ed6bd`
- Two incremental deploys during verification: 15:28 (initial), 15:52 (Read-handler patch), 16:06 (dispatch-thread deadlock fix)

Windows services:
- `LmxOpcUa` — Running, PID `29528`
- `LmxOpcUa2` — Running, PID `30684`

### Code changes — what shipped

**New config** — `MxAccessConfiguration`:
- `RuntimeStatusProbesEnabled: bool` (default `true`) — enables `<Host>.ScanState` probing for every deployed `$WinPlatform` and `$AppEngine`.
- `RuntimeStatusUnknownTimeoutSeconds: int` (default `15`) — only applies to the Unknown → Stopped transition; running hosts never time out because `ScanState` is delivered on-change only.

**New hierarchy columns** — `hierarchy.sql` and `GalaxyObjectInfo`:
- `CategoryId: int` — populated from `template_definition.category_id` (1 = $WinPlatform, 3 = $AppEngine).
- `HostedByGobjectId: int` — populated from `gobject.hosted_by_gobject_id` (the actual column name on this Galaxy schema; the plan document's guess of `host_gobject_id` was wrong). Walked up to find each variable's nearest Platform/Engine ancestor.

**New domain types** — `Host/Domain/`:
- `GalaxyRuntimeState` enum (`Unknown` / `Running` / `Stopped`).
- `GalaxyRuntimeStatus` DTO with callback/state-change timestamps, `LastScanState`, `LastError`, cumulative counters.

**New probe manager** — `Host/MxAccess/GalaxyRuntimeProbeManager.cs`:
- Pure manager, no SDK leakage. `AdviseSupervisory`s `<Host>.ScanState` for every runtime host on `SyncAsync`.
- State predicate: `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else is Stopped.
- `GetSnapshot()` forces every entry to `Unknown` when the MxAccess transport is disconnected — prevents misleading "every host stopped" display when the actual problem is the transport.
- `Tick()` only advances Unknown → Stopped on the configured timeout; Running hosts never time out (on-change delivery semantic).
- `IsHostStopped(gobjectId)` — used by the Read-path short-circuit; uses underlying state directly (not the snapshot force-unknown rewrite) so a transport outage doesn't double-flag reads.
- `Dispose()` unadvises every active probe before MxAccess teardown.

**New hosted-variables map** — `LmxNodeManager`:
- `_hostedVariables: Dictionary<int, List<BaseDataVariableState>>` — host gobject_id → list of every descendant variable, populated during `BuildAddressSpace` by walking each variable's `HostedByGobjectId` chain up to the nearest Platform/Engine. A variable hosted by an Engine inside a Platform appears in BOTH lists.
- `_hostIdsByTagRef: Dictionary<string, List<int>>` — reverse index used by the Read short-circuit, populated alongside `_hostedVariables`.
- Public `MarkHostVariablesBadQuality(int gobjectId)` — walks `_hostedVariables[gobjectId]`, sets `StatusCode = BadOutOfService` on each, calls `ClearChangeMasks(ctx, false)` to push through the OPC UA publisher.
- Public `ClearHostVariablesBadQuality(int gobjectId)` — inverse, resets to `Good` on recovery.

**OPC UA Read short-circuit** — `LmxNodeManager.Read`:
- Before the normal `_mxAccessClient.ReadAsync(tagRef)` round-trip, check `IsTagUnderStoppedHost(tagRef)`. If true, return a `DataValue { StatusCode = BadOutOfService, Value = cachedVar?.Value }` directly. Covers both direct Read requests AND OPC UA monitored-item sampling, which both flow through this override.

**Deadlock fix — `_pendingHostStateChanges` queue**:
- First draft invoked `MarkHostVariablesBadQuality` synchronously from the probe callback. MxAccess delivers `OnDataChange` on the STA thread; the callback took the node manager `Lock`. Meanwhile any worker thread inside `Read` could hold `Lock` and wait on a pending `ReadAsync` that needed the STA thread — **classic STA deadlock** (first real deploy hung in ~30s).
- Fix: probe transitions are enqueued on `ConcurrentQueue<(int GobjectId, bool Stopped)>` and the dispatch thread drains the queue inside its existing 100ms `WaitOne` loop. The dispatch thread takes `Lock` naturally without STA involvement, so no cycle. Live verified with the IDE OffScan/OnScan cycle after the fix.

**Dashboard** — `Host/Status/`:
- New `RuntimeStatusInfo` DTO + "Galaxy Runtime" panel between Galaxy Info and Historian. Shows total/running/stopped/unknown counts plus a per-host table with Name / Kind / State / Since / Last Error columns. Panel color: green (all Running), yellow (some Unknown, none Stopped), red (any Stopped), gray (MxAccess disconnected forces every row to Unknown).
- Subscriptions panel gets a new `Probes: N (bridge-owned runtime status)` line when non-zero.
- `HealthCheckService` Rule 2e: `Degraded` when any host is Stopped, ordered after Rule 1 (MxAccess transport) to avoid double-messaging when the transport is the root cause.

### Tests
- **24** new `GalaxyRuntimeProbeManagerTests`: state transitions (Unknown/Running/Stopped/recovery), unknown-resolution timeout, transport gating, sync diff, dispose, callback exception safety, `IsHostStopped` for Read-path short-circuit (Unknown/Running/Stopped/recovery/unknown-id/transport-disconnected-contract).
- Full Host suite: **471/471** tests passing. No regressions.

### Live end-to-end verification (today, against real IDE OffScan action)

**Baseline** (before OffScan, dashboard at 15:44:00):
```
Galaxy Runtime: green, 2 of 2 hosts running
DevAppEngine     $AppEngine     Running  2026-04-13T19:29:12.9475357Z
DevPlatform      $WinPlatform   Running  2026-04-13T19:29:12.9345208Z
TestMachine_001.MachineID → Status 0x00000000 (Good), value "admin_test"
```

**After operator Set OffScan on DevAppEngine in IDE** (log at 15:44:25):
```
15:44:25.554  Galaxy runtime DevAppEngine.ScanState transitioned Running → Stopped (ScanState = false (OffScan))
15:44:25.557  Marked 3971 variable(s) BadOutOfService for stopped host gobject_id=1043
```
Dashboard: red panel, `1 of 2 hosts running (1 stopped, 0 unknown)`. Health: `Degraded — Galaxy runtime has 1 of 2 host(s) stopped: DevAppEngine`. Critical: 3ms from probe callback to subtree walk complete.

**Read during stop — found bug #1** (Read handler bypassed cached state):
- Initial deploy: `TestMachine_001.MachineID` still read `0x00000000` Good with a post-stop source time from MxAccess. Revealed that `LmxNodeManager.Read` calls `_mxAccessClient.ReadAsync()` directly and never consults the in-memory `BaseDataVariableState.StatusCode` we set during the walk.
- Fix: `IsTagUnderStoppedHost` short-circuit in Read override. After patch: `[808D0000] BadOutOfService` on all three test tags.

**Read during stop — found bug #2** (deadlock):
- After shipping the Read patch, the service hung on the next OffScan. HTTP listener accepted connections but never responded, and service shutdown stuck at STOP_PENDING for 15+ seconds until manually killed.
- Diagnosis: the probe callback fires `HandleProbeUpdate` → `MarkHostVariablesBadQuality` → acquires `Lock` on the STA thread. Meanwhile the dispatch thread can sit inside `Read` holding `Lock` and waiting for an STA-routed `ReadAsync`. Circular wait.
- Fix: enqueue probe transitions onto `ConcurrentQueue` and drain on the dispatch thread where `Lock` acquisition is safe. Second deploy resolved the hang.

**A/B verification** (instance1 patched, instance2 not yet):
| Instance | `TestMachine_001.MachineID` |
|---|---|
| `LmxOpcUa` (patched) | `0x808D0000` BadOutOfService ✅ |
| `LmxOpcUa2` (old) | `0x00000000` Good, stale ❌ |

Clean A/B confirmed the Read patch is required; instance2 subsequently updated to match.

**Recovery** (operator Set OnScan on DevAppEngine, log at 16:10:05):
```
16:10:05.129  Galaxy runtime DevAppEngine.ScanState transitioned → Running
16:10:05.130  Cleared bad-quality override on 3971 variable(s) for recovered host gobject_id=1043
```
Dashboard: back to green, `DevAppEngine` Running with new `Since = 20:10:05.129Z`. All three test tags back to `0x00000000` Good with fresh source timestamps. 1ms from probe callback to subtree clear.

### Client freeze observation — phase 2 decision gate

The original production issue had two symptoms: (1) incomplete quality flip and (2) OPC UA client freeze. The subtree walk + Read short-circuit fixes (1) definitively. For (2), there's still a pending dispatch-queue flood of per-tag MxAccess callbacks that MxAccess fans out when a host stops — the bridge doesn't currently drop them. We **deliberately did not** ship dispatch suppression in this pass, on the grounds that the subtree walk may coalesce notifications sufficiently at the SDK publisher level to resolve the freeze on its own. The verification against the live Galaxy with no OPC UA clients subscribed doesn't tell us one way or the other — the next subscribed-client test against a real stop will be the deciding measurement. If the client still freezes after the walk, phase 2 adds pre-dispatch filtering for tags under Stopped hosts.

### What's deferred

- **Synthetic OPC UA child nodes** (`$RuntimeState`, `$LastCallbackTime`, etc.) under each host object. Dashboard + health surface give operators visibility today; the OPC UA synthetic nodes are a follow-up.
- **Dispatch suppression** — gated on observing whether the subtree walk alone resolves the client freeze in production.
- **Documentation updates** — the `docs/` guides (`MxAccessBridge.md`, `StatusDashboard.md`, `Configuration.md`, `HistoricalDataAccess.md`) still describe the pre-runtime-status behavior. Need a consolidated doc pass covering this feature plus the historian cluster + health surface updates from earlier today.

## Stability Review Fixes 2026-04-14

Code changes only — **not yet deployed** to the instance1/instance2 services. Closes all four residual findings from `docs/stability-review-20260413.md`; the document was green on shipped features but flagged latent defects that degraded the stability guarantees the runtime-status feature relies on. Deploy procedure at the end of this section.

### Findings closed

**Finding 1 (High) — Probe rollback on subscribe failure.**
`GalaxyRuntimeProbeManager.SyncAsync` pre-populated `_byProbe` / `_probeByGobjectId` before awaiting `SubscribeAsync`. When the advise call threw, the catch block logged a warning but left the phantom entry in place; `Tick()` later transitioned it from Unknown to Stopped after `RuntimeStatusUnknownTimeoutSeconds`, firing `_onHostStopped` and walking the subtree of a host that was never actually advised. In a codebase where the same probe manager also drives the Read-path short-circuit and subtree quality invalidation (the 2026-04-13 feature), a false-negative here fans out into hundreds of BadOutOfService flags on live variables. Fix: promote `toSubscribe` to `List<(int GobjectId, string Probe)>` so the catch path can reacquire `_lock` and remove both dictionaries. The rollback compares against the captured probe string before removing so a concurrent resync cannot accidentally delete a legitimate re-add.

**Finding 2 (Medium) — Surface dashboard bind failure.**
`StatusWebServer.Start()` already returned `bool`, but `OpcUaService.Start()` ignored it, so a failed bind (port in use, permissions) was invisible at the service level. Fix: capture the return value, on `false` log a Warning (`Status dashboard failed to bind on port {Port}; service continues without dashboard`), dispose the unstarted instance, and set a new `OpcUaService.DashboardStartFailed` property. Degraded mode — matches the established precedent for other optional startup subsystems (MxAccess connect, Galaxy DB connect, initial address space build).

**Finding 3 (Medium) — Bounded timeouts on sync-over-async.**
Seven sync-over-async `.GetAwaiter().GetResult()` sites in `LmxNodeManager` (rebuild probe sync, Read, Write, HistoryReadRaw/Processed/AtTime/Events) blocked the OPC UA stack thread without an outer bound. Inner `MxAccessClient.ReadAsync` / `WriteAsync` already apply per-call `CancelAfter`, but `SubscribeAsync`, `SyncAsync`, and the historian reads did not — and the pattern itself is a stability risk regardless of inner behavior. Fix: new `SyncOverAsync.WaitSync(task, timeout, operation)` helper + two new config fields `MxAccess.RequestTimeoutSeconds=30` and `Historian.RequestTimeoutSeconds=60`. Every sync-over-async site now wraps the task in `WaitSync`, catches `TimeoutException` explicitly, and maps to `StatusCodes.BadTimeout` (or logs a warning and continues in the rebuild case — probe sync is advisory). `ConfigurationValidator` rejects `RequestTimeoutSeconds < 1` for both and warns when operators set the outer bound below the inner read/write / command timeout.

**Finding 4 (Low) — Track fire-and-forget subscribes.**
Alarm auto-subscribe, subtree alarm auto-subscribe, and transferred-subscription restore all called `_mxAccessClient.SubscribeAsync(...).ContinueWith(..., OnlyOnFaulted)` with no tracking, so shutdown raced pending subscribes and ordering was impossible to reason about. Fix: new `TrackBackgroundSubscribe(tag, context)` helper in `LmxNodeManager` that stashes the task in `_pendingBackgroundSubscribes` (a `ConcurrentDictionary<long, Task>` with a monotonic `Interlocked.Increment` id), and a continuation that removes the entry and logs faults with the supplied context. `Dispose(bool)` drains the dictionary with `Task.WaitAll(snapshot, 5s)` after stopping the dispatch thread — bounded so shutdown cannot stall on a hung backend, and logged at Info so operators can see the drain count.

### Code changes

- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` — `toSubscribe` carries gobject id; catch path rolls back both dictionaries under `_lock`, with concurrent-overwrite guard.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs` — capture `StatusWeb.Start()` return; `DashboardStartFailed` internal property; dispose unstarted instance on failure.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Utilities/SyncOverAsync.cs` (new) — `WaitSync<T>(Task<T>, TimeSpan, string)` and non-generic overload with inner-exception unwrap.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/MxAccessConfiguration.cs` — `RequestTimeoutSeconds: int = 30`.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/HistorianConfiguration.cs` — `RequestTimeoutSeconds: int = 60`.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs` — logs both new values, rejects `< 1`, warns on inner/outer misorder.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs` — constructor takes the two new timeout values (with defaults); seven sync-over-async call sites wrapped in `SyncOverAsync.WaitSync` + `TimeoutException → BadTimeout` catch; `TrackBackgroundSubscribe` helper; `_pendingBackgroundSubscribes` dictionary + `_backgroundSubscribeCounter`; `DrainPendingBackgroundSubscribes()` in `Dispose`; three fire-and-forget sites replaced with helper calls.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxOpcUaServer.cs` — constructor plumbing for the two new timeouts.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/OpcUaServerHost.cs` — accepts `HistorianConfiguration`, threads both timeouts through to `LmxOpcUaServer`.

### Tests

- `tests/.../MxAccess/GalaxyRuntimeProbeManagerTests.cs` — 3 new tests: `Sync_SubscribeThrows_DoesNotLeavePhantomEntry`, `Sync_SubscribeThrows_TickDoesNotFireStopCallback`, `Sync_SubscribeSucceedsAfterRetry_AppearsInSnapshot`. Use the existing `FakeMxAccessClient.SubscribeException` hook — no helper changes needed.
- `tests/.../Status/StatusWebServerTests.cs` — 1 new test: `Start_WhenPortInUse_ReturnsFalse`. Grabs a port with a throwaway `HttpListener`, tries to start `StatusWebServer` on the same port, asserts `Start()` returns `false`.
- `tests/.../Wiring/OpcUaServiceDashboardFailureTests.cs` (new) — 1 test: `Start_DashboardPortInUse_ContinuesInDegradedMode`. Builds a full `OpcUaService` with `FakeMxProxy` + `FakeGalaxyRepository`, binds the dashboard port externally, starts the service, asserts `ServerHost != null`, `DashboardStartFailed == true`, `StatusWeb == null`.
- `tests/.../Utilities/SyncOverAsyncTests.cs` (new) — 7 tests covering happy path, never-completing task → TimeoutException with operation name, faulted task → inner exception unwrap, null-task arg check.
- `tests/.../Configuration/ConfigurationLoadingTests.cs` — 3 new tests: `Validator_MxAccessRequestTimeoutZero_ReturnsFalse`, `Validator_HistorianRequestTimeoutZero_ReturnsFalse`, `Validator_DefaultRequestTimeouts_AreSensible`.

**Test results:** full suite **486/486** passing. First run hit a single transient failure in `ChangeDetectionServiceTests.ChangedTimestamp_TriggersAgain` (pre-existing timing-sensitive test — poll interval 1s with 500ms + 1500ms sleeps races under load); the test passes on retry and is unrelated to these changes. The 15 new tests added by this pass all green on both runs.

### Documentation updates

- `docs/MxAccessBridge.md` — Runtime-status section gains a new point 5 documenting the subscribe-failure rollback; new "Request Timeout Safety Backstop" section describing the outer `RequestTimeoutSeconds` bound.
- `docs/HistoricalDataAccess.md` — Config class snippet and property table updated with `RequestTimeoutSeconds`.
- `docs/ServiceHosting.md` — Step 12 (startup sequence) documents the degraded-mode dashboard policy and the new `DashboardStartFailed` flag.
- `docs/Configuration.md` — `MxAccess.RequestTimeoutSeconds` (30s) and `Historian.RequestTimeoutSeconds` (60s) added to both the property tables and the `appsettings.json` full example.
- `docs/StatusDashboard.md` — New subsection "Dashboard start failures are non-fatal" with the log grep operators should use.

### Deploy plan (not yet executed)

This is a code-only change; the built binary has not been copied to `C:\publish\lmxopcua\instance1` / `instance2` yet. When deploying, follow the procedure from the 2026-04-13 runtime-status deploy (service_info.md:572-680):

1. Backup `C:\publish\lmxopcua\instance1` and `instance2` to `backups\20260414-<HHMMSS>-instance{1,2}`. Preserve each `appsettings.json`.
2. Build the Host project in Release and copy `ZB.MOM.WW.LmxOpcUa.Host.exe` (and any changed DLLs) to both instance roots. The Historian plugin layout at `<instance>\Historian\` is unchanged.
3. Restart the `LmxOpcUa` and `LmxOpcUa2` Windows services.
4. In the startup log for each instance, verify the new config echoes appear:
   - `MxAccess.RuntimeStatusProbesEnabled=..., RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s`
   - `Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60`
   - `=== Configuration Valid ===`
   - `LmxOpcUa service started successfully`
5. CLI smoke test on both endpoints (matches the 2026-03-25 baseline at service_info.md:370-376):
   - `opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa`
   - `opcuacli-dotnet.exe connect -u opc.tcp://localhost:4841/LmxOpcUa`
   - `opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa` → ServiceLevel=200 (primary)
   - `opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa` → ServiceLevel=150 (secondary)
6. Runtime-status regression check (the most sensitive thing these fixes could break): repeat the IDE OffScan / OnScan cycle documented at service_info.md:630-669. Dashboard at `http://localhost:8085/` must go red on OffScan, green on OnScan; `TestMachine_001.MachineID` must flip between `0x808D0000 BadOutOfService` and `0x00000000 Good` at each transition with the same sub-100ms latency as the original deploy.
7. Record PIDs and live verification results in a follow-up section of this file (`## Stability Review Fixes 2026-04-14 — Deploy`), matching the layout conventions from earlier entries.

### Finding 1 manual regression check

Before the regression test landed, the only way to exercise the bug in production was to temporarily revoke the MxAccess user's probe subscription permission or point the probe manager at a non-existent host. After the fix, the same scenarios should leave `GetSnapshot()` empty (no phantom entries) and the dashboard Galaxy Runtime panel should read `0 of N hosts` rather than `0 running, N stopped`. The three new `GalaxyRuntimeProbeManagerTests` cover this deterministically via `FakeMxAccessClient.SubscribeException` so a future regression is caught at CI time.

### Risk notes

- **Timeout floor discipline.** The two new `RequestTimeoutSeconds` values have conservative defaults (30s MxAccess, 60s Historian). Setting them too low would cause spurious `BadTimeout` errors on a slow-but-healthy backend. `ConfigurationValidator` rejects `< 1` and warns below inner timeouts so misconfiguration is visible at startup.
- **Abandoned tasks on timeout.** `SyncOverAsync.WaitSync` does not cancel the underlying task — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess / Historian clients are shared singletons whose background work does not capture request-scoped state.
- **Background subscribe drain window.** 5 seconds is enough for healthy subscribes to settle but not long enough to stall shutdown if MxAccess is hung. If drain times out, shutdown continues — this is intentional.
- **Probe rollback concurrency.** The catch path reacquires `_lock` after `await`. A concurrent `SyncAsync` may have re-added the same gobject under a new probe name; the code compares against the captured probe string before removing, so a legitimate re-add is not clobbered.

## Stability Review Fixes 2026-04-14 — Deploy

Updated: `2026-04-14 00:40-00:43 America/New_York`

Both instances redeployed with the stability-review fixes documented above. Closes all four findings from `docs/stability-review-20260413.md` on the live services.

Backups:
- `C:\publish\lmxopcua\backups\20260414-003948-instance1` — pre-deploy `ZB.MOM.WW.LmxOpcUa.Host.exe` (7,997,952 bytes) + `appsettings.json`
- `C:\publish\lmxopcua\backups\20260414-003948-instance2` — pre-deploy `ZB.MOM.WW.LmxOpcUa.Host.exe` (7,997,952 bytes) + `appsettings.json`

Configuration preserved:
- Both `appsettings.json` were not overwritten. The two new fields (`MxAccess.RequestTimeoutSeconds`, `Historian.RequestTimeoutSeconds`) inherit their defaults from the binary (30s and 60s respectively). Operators can opt into explicit values by editing `appsettings.json`; defaults are logged at startup regardless.

Deployed binary (both instances):
- `ZB.MOM.WW.LmxOpcUa.Host.exe`
- Last write time: `2026-04-14 00:40:48 -04:00`
- Size: `7,986,688` (down 11,264 bytes from the previous build — three fire-and-forget `.ContinueWith` blocks were replaced with a single `TrackBackgroundSubscribe` helper)

Pre-deploy state note: both services were STOPPED when the deploy started (`sc.exe query` reported `WIN32_EXIT_CODE=1067`), but two host processes were still alive (`tasklist` showed PID 34828 holding instance1 and PID 27036 holding instance2). The zombies held open file handles on both exes, so the Windows SCM's "STOPPED" state was lying — the previous Services were still running out-of-band of the SCM. The zombie processes were terminated with `taskkill //F` before copying the new binary. This is a one-shot clean-up: the new deploy does not require the same.

Windows services:
- `LmxOpcUa` — Running, PID `32884`
- `LmxOpcUa2` — Running, PID `40796`

Restart evidence (instance1 `logs/lmxopcua-20260414.log`):
```
2026-04-14 00:40:55.759 -04:00 [INF] MxAccess.RuntimeStatusProbesEnabled=true, RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s
2026-04-14 00:40:55.791 -04:00 [INF] Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60
2026-04-14 00:40:55.794 -04:00 [INF] === Configuration Valid ===
2026-04-14 00:41:02.406 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance1\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-14 00:41:06.870 -04:00 [INF] LmxOpcUa service started successfully
```

Restart evidence (instance2 `logs/lmxopcua-20260414.log`):
```
2026-04-14 00:40:56.812 -04:00 [INF] MxAccess.RuntimeStatusProbesEnabled=true, RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s
2026-04-14 00:40:56.847 -04:00 [INF] Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60
2026-04-14 00:40:56.850 -04:00 [INF] === Configuration Valid ===
2026-04-14 00:41:07.805 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance2\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-14 00:41:12.008 -04:00 [INF] LmxOpcUa service started successfully
```

The two new `RequestTimeoutSeconds` values are visible in both startup traces, confirming the new configuration plumbing reached `ConfigurationValidator`. Startup latency (config-valid → service-started): instance1 ~11.1s, instance2 ~15.2s — within the normal envelope for the Historian plugin load sequence.

CLI verification (via `dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI`):

```
connect    opc.tcp://localhost:4840/LmxOpcUa → Server: LmxOpcUa,  Security: None, Connection successful
connect    opc.tcp://localhost:4841/LmxOpcUa → Server: LmxOpcUa2, Security: None, Connection successful
redundancy opc.tcp://localhost:4840/LmxOpcUa → Warm, ServiceLevel=200, urn:localhost:LmxOpcUa:instance1
redundancy opc.tcp://localhost:4841/LmxOpcUa → Warm, ServiceLevel=150, urn:localhost:LmxOpcUa:instance2
read       opc.tcp://localhost:4840/LmxOpcUa -n 'ns=3;s=MESReceiver_001.MoveInPartNumbers'
  → Value:  System.String[]
  → Status: 0x00000000
  → Source: 2026-04-14T04:43:46.2267096Z
```

Primary advertises ServiceLevel 200, secondary advertises 150 — redundancy baseline preserved. End-to-end data flow is healthy: the Read on `MESReceiver_001.MoveInPartNumbers` returns Good quality with a fresh source timestamp, confirming MxAccess is connected and the address space is populated. Note that the namespace is `ns=3` now, not the `ns=1` listed in the 2026-03-25 baseline at the top of this file — the auth-consolidation deploy on 2026-03-28 moved the Galaxy namespace to `ns=3` and that move has carried through every deploy since. The top-of-file `ns=1` CLI example should be treated as historical.

### CLI tooling note

The earlier service_info.md entry referenced `tools/opcuacli-dotnet/bin/Debug/net10.0/opcuacli-dotnet.exe`. That binary does not exist on the current checkout; the CLI lives at `src/ZB.MOM.WW.LmxOpcUa.Client.CLI/` and must be invoked via `dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI`. The README / `docs/Client.CLI.md` should be the source of truth going forward.

### Runtime-status regression check

**Not performed in this deploy.** The runtime-status subtree-walk / Read-short-circuit regression check from service_info.md:630-669 requires an operator to flip a `$AppEngine` OffScan in the AVEVA IDE and observe the dashboard + Read behavior, which needs a real operator session. The automated CLI smoke test above does not exercise the probe-manager callback path.

The code changes in this deploy are defensive and do not alter the runtime-status feature's control flow except in one place (subscribe rollback, which only triggers when `SubscribeAsync` throws). The 471/471 baseline on the probe manager tests plus the three new rollback regression tests give high confidence that the runtime-status behavior is preserved. If a human operator runs the IDE OffScan/OnScan cycle and observes an anomaly, the fix is most likely isolated to `GalaxyRuntimeProbeManager.SyncAsync` — see Finding 1 above — and can be reverted by restoring `C:\publish\lmxopcua\backups\20260414-003948-instance{1,2}\ZB.MOM.WW.LmxOpcUa.Host.exe`.

## Galaxy Platform Scope Filter

Updated: `2026-04-16 00:21-00:27 America/New_York`

Both instances updated with a new `GalaxyRepository.Scope` configuration flag that controls whether the OPC UA server loads the entire Galaxy or only objects hosted by the local platform. Reduces address space size, MXAccess subscription count, and memory footprint on multi-node Galaxy deployments.

Backups:
- `C:\publish\lmxopcua\backups\20260416-002120-instance1`
- `C:\publish\lmxopcua\backups\20260416-002120-instance2`

Configuration preserved:
- Both `appsettings.json` updated with new fields only (`Scope`, `PlatformName`). All existing settings preserved.

Deployed binary (both instances):
- `ZB.MOM.WW.LmxOpcUa.Host.exe`
- Last write time: `2026-04-16 00:23 -04:00`
- Size: `7,993,344`

Windows services:
- `LmxOpcUa` — Running, PID `15204`
- `LmxOpcUa2` — Running, PID `9544`

### Code changes

- `Configuration/GalaxyScope.cs` (new) — enum: `Galaxy` (default, all deployed objects), `LocalPlatform` (only objects hosted by the local platform's subtree).
- `Domain/PlatformInfo.cs` (new) — DTO mapping `platform.platform_gobject_id` to `platform.node_name`.
- `Configuration/GalaxyRepositoryConfiguration.cs` — added `Scope: GalaxyScope` (default `Galaxy`) and `PlatformName: string?` (optional override for `Environment.MachineName`).
- `GalaxyRepository/PlatformScopeFilter.cs` (new) — stateless C# filter applied after the existing SQL queries (preserves `GR-006: const string, no dynamic SQL` convention). Algorithm: (1) resolve the local platform's `gobject_id` via a new `PlatformLookupSql` query against the `platform` table, (2) collect all AppEngine hosts under that platform, (3) include all objects hosted by any host in the set, (4) walk `ParentGobjectId` chains upward to retain ancestor areas for a connected browse tree.
- `GalaxyRepository/GalaxyRepositoryService.cs` — added `PlatformLookupSql` const query, `GetPlatformsAsync()` method, post-query filtering in `GetHierarchyAsync`/`GetAttributesAsync` with cached `_scopeFilteredGobjectIds` for cross-method consistency.
- `Configuration/ConfigurationValidator.cs` — logs `Scope` and effective `PlatformName` at startup.

### Configuration

New fields in `GalaxyRepository` section of `appsettings.json`:

```json
"GalaxyRepository": {
    "Scope": "Galaxy",
    "PlatformName": null
}
```

- `Scope`: `"Galaxy"` (default) loads all deployed objects. `"LocalPlatform"` filters to the local platform only.
- `PlatformName`: When null, uses `Environment.MachineName`. Set explicitly to target a specific platform by hostname.

Both instances deployed with `"Scope": "Galaxy"` (full Galaxy, backward-compatible default).

### Tests

8 new unit tests in `PlatformScopeFilterTests.cs`:
- Two-platform Galaxy filtering (platform A, platform B)
- Case-insensitive node name matching
- No matching platform → empty result
- Ancestor area inclusion for connected tree
- Area exclusion when no local descendants
- Attribute filtering by gobject_id set
- Original order preservation

Full suite: **494/494** tests passing (8 new, 0 regressions).

### Live verification

**LocalPlatform scope test** (instance1, temporarily set to `"Scope": "LocalPlatform"`):
```
Startup log:
  GalaxyRepository.Scope="LocalPlatform", PlatformName=DESKTOP-6JL3KKO
  GalaxyRepository.PlatformName not set — using Environment.MachineName 'DESKTOP-6JL3KKO'
  GetHierarchyAsync returned 49 objects
  GetPlatformsAsync returned 1 platform(s)
  Scope filter targeting platform 'DESKTOP-6JL3KKO' (gobject_id=1042)
  Scope filter retained 3 of 49 objects for platform 'DESKTOP-6JL3KKO'
  GetAttributesAsync returned 4206 attributes (extended=true)
  Scope filter retained 386 of 4206 attributes
  Address space built: 2 objects, 386 variables, 386 tag references, 0 alarm tags, 2 runtime hosts

CLI browse ZB:
  DEV → DevAppEngine, DevPlatform (only local platform subtree)
  TestArea, TestArea2 → absent (filtered out)

CLI read DEV.ScanState:
  Value: True, Status: 0x00000000 (Good)
```

**Galaxy scope comparison** (instance2, `"Scope": "Galaxy"`):
```
CLI browse ZB/DEV:
  DevAppEngine, DevPlatform, TestArea, TestArea2 (full Galaxy)
```

**Galaxy scope restored** (instance1, set back to `"Scope": "Galaxy"`):
```
CLI browse ZB/DEV:
  DevAppEngine, DevPlatform, TestArea, TestArea2 (full Galaxy restored)
```

**Redundancy baseline preserved** (both instances):
```
instance1 → Warm, ServiceLevel=200, urn:localhost:LmxOpcUa:instance1
instance2 → Warm, ServiceLevel=150, urn:localhost:LmxOpcUa:instance2
```

## Notes

The service deployment and restart succeeded. The live CLI checks confirm the endpoint is reachable and that the array node identifier has changed to the bracketless form. The array value on the live service still prints as blank even though the status is good, so if this environment should have populated `MoveInPartNumbers`, the runtime data path still needs follow-up investigation.