docs(lmxproxy): mark gap 1 and gap 2 as resolved with test verification
Gap 1: Active health probing verified — 60s recovery after platform restart. Gap 2: Address-based subscription cleanup — no stale handles. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -65,34 +65,35 @@ Decisions made during implementation that differ from or extend the original pla
|
||||
|
||||
## Gap 1: No active connection health probing
|
||||
|
||||
**Status**: Open. Requires implementation.
|
||||
**Status**: Resolved (2026-03-22, commit `a6c01d7`).
|
||||
|
||||
**Problem**: `MxAccessClient.IsConnected` checks `_connectionState == Connected && _connectionHandle > 0`. When the AVEVA platform (aaBootstrap) is killed or restarted, the MxAccess COM object and handle remain valid in memory — `IsConnected` stays `true`. The auto-reconnect monitor loop (`MonitorConnectionAsync`) only triggers when `IsConnected` is `false`, so it never attempts reconnection.
|
||||
|
||||
**Observed behavior** (tested 2026-03-22): After killing the aaBootstrap process, all reads returned null values with Bad quality indefinitely. The monitor loop kept seeing `IsConnected == true` and never reconnected. Even restarting the v2 service didn't help until the platform objects were manually restarted via the System Management Console.
|
||||
**Observed behavior** (tested 2026-03-22): After killing the aaBootstrap process, all reads returned null values with Bad quality indefinitely. The monitor loop kept seeing `IsConnected == true` and never reconnected.
|
||||
|
||||
**Impact**: After any platform disruption (AppEngine restart, aaBootstrap crash, platform redeploy), LmxProxy returns Bad quality on all reads/writes until the v2 service is manually restarted AND the platform objects are manually restarted. There is no automatic recovery.
|
||||
**Fix implemented**: The monitor loop now actively probes the connection using `ProbeConnectionAsync`, which reads a configurable test tag and classifies the result as `Healthy`, `TransportFailure`, or `DataDegraded`.
|
||||
- `TransportFailure` for N consecutive probes (default 3) → forced disconnect + full reconnect (new COM object, `Register`, `RecreateStoredSubscriptionsAsync`)
|
||||
- `DataDegraded` → stay connected, back off probe interval to 30s, report degraded status (platform objects may be stopped)
|
||||
- `Healthy` → reset counters, resume normal interval
|
||||
|
||||
**Proposed fix**: The monitor loop should actively probe the connection by reading a test tag (e.g., `TestChildObject.TestBool` or a configurable health tag). If the read returns null value or Bad quality for N consecutive probes, the monitor should:
|
||||
1. Set `IsConnected = false` (transition to `Disconnected` or `Error` state)
|
||||
2. Tear down the stale COM object (`Unregister`, `ReleaseComObject`)
|
||||
3. Attempt full reconnect (`ConnectAsync` → creates new COM object → `Register` → `RecreateStoredSubscriptionsAsync`)
|
||||
**Verified** (tested 2026-03-22): Graceful platform stop via SMC → 4 failed probes → automatic reconnect → reads restored within ~60 seconds. All 17 integration tests pass after recovery. Subscribed clients receive `Bad_NotConnected` quality during outage, then Good quality resumes automatically.
|
||||
|
||||
This matches the `DetailedHealthCheckService` pattern that already reads a test tag — the same logic should be embedded in the monitor loop.
|
||||
|
||||
**Configuration**: Add `HealthCheck.TestTagAddress` to `appsettings.json` (already exists, currently used only by `DetailedHealthCheckService`). The monitor loop would reuse this setting. Add `HealthCheck.MaxConsecutiveFailures` (default 3) — number of consecutive Bad probes before triggering reconnect.
|
||||
**Configuration** (`appsettings.json` → `HealthCheck` section):
|
||||
- `TestTagAddress`: Tag to probe (default `TestChildObject.TestBool`)
|
||||
- `ProbeTimeoutMs`: Probe read timeout (default 5000ms)
|
||||
- `MaxConsecutiveTransportFailures`: Failures before forced reconnect (default 3)
|
||||
- `DegradedProbeIntervalMs`: Probe interval in degraded mode (default 30000ms)
|
||||
|
||||
## Gap 2: Stale SubscriptionManager handles after reconnect
|
||||
|
||||
**Status**: Open. Minor — fails silently during cleanup.
|
||||
**Status**: Resolved (2026-03-22, commit `a6c01d7`).
|
||||
|
||||
**Problem**: When `SubscriptionManager` creates MxAccess subscriptions via `_scadaClient.SubscribeAsync()`, it stores `IAsyncDisposable` handles in `_mxAccessHandles`. After a platform disconnect/reconnect cycle, `MxAccessClient.RecreateStoredSubscriptionsAsync()` recreates COM subscriptions from `_storedSubscriptions`, but `SubscriptionManager._mxAccessHandles` still holds the old (now-invalid) handles.
|
||||
**Problem**: `SubscriptionManager` stored `IAsyncDisposable` handles from `_scadaClient.SubscribeAsync()` in `_mxAccessHandles`. After a reconnect, `MxAccessClient.RecreateStoredSubscriptionsAsync()` recreated COM subscriptions internally but `SubscriptionManager._mxAccessHandles` still held stale handles. Additionally, a batch subscription stored the same handle for every address — disposing one address would dispose the entire batch.
|
||||
|
||||
**Impact**: When a client unsubscribes after a reconnect, `SubscriptionManager.UnsubscribeClient()` tries to dispose the stale handle, which calls `MxAccessClient.UnsubscribeAsync()` with addresses that may have different item handles in the new connection. The unsubscribe may fail silently or target wrong handles.
|
||||
|
||||
**Proposed fix**: Either:
|
||||
- (a) Have `SubscriptionManager` listen for `ConnectionStateChanged` events and clear `_mxAccessHandles` on disconnect (the recreated subscriptions from `RecreateStoredSubscriptionsAsync` don't produce new SubscriptionManager handles), or
|
||||
- (b) Have `MxAccessClient` notify `SubscriptionManager` after reconnect so it can re-register its handles.
|
||||
**Fix implemented**: Removed `_mxAccessHandles` entirely. `SubscriptionManager` no longer tracks COM subscription handles. Ownership is cleanly split:
|
||||
- `SubscriptionManager` owns client routing and ref-counting only
|
||||
- `MxAccessClient` owns COM subscription lifecycle via `_storedSubscriptions` and `_addressToHandle`
|
||||
- Unsubscribe uses `_scadaClient.UnsubscribeByAddressAsync(addresses)` — address-based, resolves to current handles regardless of reconnect history
|
||||
|
||||
## Gap 3: AVEVA objects don't auto-start after platform crash
|
||||
|
||||
|
||||
Reference in New Issue
Block a user