docs(lmxproxy): document reconnection gaps from platform restart testing

Tested aaBootstrap kill on windev — three gaps identified: 1. No active health probing (IsConnected stays true on dead connection) 2. Stale SubscriptionManager handles after reconnect cycle 3. AVEVA objects don't auto-start after platform crash (platform behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 06:19:30 -04:00
parent 5a9574fb95
commit 86a15c0a65
1 changed files with 45 additions and 0 deletions
--- a/lmxproxy/docs/deviations.md
+++ b/lmxproxy/docs/deviations.md
@@ -58,3 +58,48 @@ Decisions made during implementation that differ from or extend the original pla
 **Actual**: SubscriptionManager must also call `IScadaClient.SubscribeAsync()` to create the underlying COM subscriptions when a tag is first subscribed, and dispose them when the last client unsubscribes.
 **Why**: The Phase 2 implementation tracked client-to-tag routing in internal dictionaries but never called `MxAccessClient.SubscribeAsync()` to create the actual MxAccess COM subscriptions (`AddItem` + `AdviseSupervisory`). Without the COM subscription, `OnDataChange` never fired and no updates were delivered to clients. This caused the `Subscribe_ReceivesUpdates` integration test to receive 0 updates over 30 seconds.
 **How to apply**: `SubscriptionManager.Subscribe()` collects newly-seen tags (those without an existing `TagSubscription`) and calls `_scadaClient.SubscribeAsync()` for them, passing `OnTagValueChanged` as the callback. The returned `IAsyncDisposable` handles are tracked in `_mxAccessHandles` per address and disposed in `UnsubscribeClient()` when the last client for a tag leaves.
+
+---
+
+# Known Gaps
+
+## Gap 1: No active connection health probing
+
+**Status**: Open. Requires implementation.
+
+**Problem**: `MxAccessClient.IsConnected` checks `_connectionState == Connected && _connectionHandle > 0`. When the AVEVA platform (aaBootstrap) is killed or restarted, the MxAccess COM object and handle remain valid in memory — `IsConnected` stays `true`. The auto-reconnect monitor loop (`MonitorConnectionAsync`) only triggers when `IsConnected` is `false`, so it never attempts reconnection.
+
+**Observed behavior** (tested 2026-03-22): After killing the aaBootstrap process, all reads returned null values with Bad quality indefinitely. The monitor loop kept seeing `IsConnected == true` and never reconnected. Even restarting the v2 service didn't help until the platform objects were manually restarted via the System Management Console.
+
+**Impact**: After any platform disruption (AppEngine restart, aaBootstrap crash, platform redeploy), LmxProxy returns Bad quality on all reads/writes until the v2 service is manually restarted AND the platform objects are manually restarted. There is no automatic recovery.
+
+**Proposed fix**: The monitor loop should actively probe the connection by reading a test tag (e.g., `TestChildObject.TestBool` or a configurable health tag). If the read returns null value or Bad quality for N consecutive probes, the monitor should:
+1. Set `IsConnected = false` (transition to `Disconnected` or `Error` state)
+2. Tear down the stale COM object (`Unregister`, `ReleaseComObject`)
+3. Attempt full reconnect (`ConnectAsync` → creates new COM object → `Register` → `RecreateStoredSubscriptionsAsync`)
+
+This matches the `DetailedHealthCheckService` pattern that already reads a test tag — the same logic should be embedded in the monitor loop.
+
+**Configuration**: Add `HealthCheck.TestTagAddress` to `appsettings.json` (already exists, currently used only by `DetailedHealthCheckService`). The monitor loop would reuse this setting. Add `HealthCheck.MaxConsecutiveFailures` (default 3) — number of consecutive Bad probes before triggering reconnect.
+
+## Gap 2: Stale SubscriptionManager handles after reconnect
+
+**Status**: Open. Minor — fails silently during cleanup.
+
+**Problem**: When `SubscriptionManager` creates MxAccess subscriptions via `_scadaClient.SubscribeAsync()`, it stores `IAsyncDisposable` handles in `_mxAccessHandles`. After a platform disconnect/reconnect cycle, `MxAccessClient.RecreateStoredSubscriptionsAsync()` recreates COM subscriptions from `_storedSubscriptions`, but `SubscriptionManager._mxAccessHandles` still holds the old (now-invalid) handles.
+
+**Impact**: When a client unsubscribes after a reconnect, `SubscriptionManager.UnsubscribeClient()` tries to dispose the stale handle, which calls `MxAccessClient.UnsubscribeAsync()` with addresses that may have different item handles in the new connection. The unsubscribe may fail silently or target wrong handles.
+
+**Proposed fix**: Either:
+- (a) Have `SubscriptionManager` listen for `ConnectionStateChanged` events and clear `_mxAccessHandles` on disconnect (the recreated subscriptions from `RecreateStoredSubscriptionsAsync` don't produce new SubscriptionManager handles), or
+- (b) Have `MxAccessClient` notify `SubscriptionManager` after reconnect so it can re-register its handles.
+
+## Gap 3: AVEVA objects don't auto-start after platform crash
+
+**Status**: Documented. Platform behavior, not an LmxProxy issue.
+
+**Observed behavior** (tested 2026-03-22): After killing aaBootstrap, the service auto-restarted (via Windows SCM recovery or Watchdog) within seconds. However, the ArchestrA objects (TestChildObject) did not automatically start. MxAccess connected successfully (`Register()` returned a valid handle) but all tag reads returned null values with Bad quality for 40+ minutes. Objects only recovered after manual restart via the System Management Console (SMC).
+
+**Implication for LmxProxy**: Even with Gap 1 fixed (active probing + reconnect), reads will still return Bad quality until the platform objects are running. LmxProxy cannot fix this — it's a platform-level recovery issue. The health check should report this clearly: "MxAccess connected but tag quality is Bad — platform objects may need manual restart."
+
+**Timeline**: aaBootstrap restart from SMC (graceful) takes ~5 minutes for objects to come back. aaBootstrap kill (crash) requires manual object restart via SMC — objects do not auto-recover.