Files

Joseph Doherty 86a15c0a65 docs(lmxproxy): document reconnection gaps from platform restart testing

Tested aaBootstrap kill on windev — three gaps identified:
1. No active health probing (IsConnected stays true on dead connection)
2. Stale SubscriptionManager handles after reconnect cycle
3. AVEVA objects don't auto-start after platform crash (platform behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-22 06:19:30 -04:00

12 KiB

Raw Blame History

LmxProxy v2 Rebuild — Deviations & Key Technical Decisions

Decisions made during implementation that differ from or extend the original plan.

1. Grpc.Tools downgraded to 2.68.1

Plan specified: Grpc.Tools 2.71.0 Actual: 2.68.1 Why: protoc.exe from 2.71.0 crashes with access violation (exit code 0xC0000005) on windev (Windows 10, x64). The 2.68.1 version works reliably. How to apply: If upgrading Grpc.Tools in the future, test protoc on windev first.

2. STA Dispatch Thread replaced with Task.Run

Plan specified: Dedicated STA thread with BlockingCollection<Action> dispatch queue and Application.DoEvents() message pump for all COM operations. Actual: Task.Run on thread pool (MTA) for all COM operations, matching the v1 pattern. Why: The STA thread's message pump (Application.DoEvents()) between work items was insufficient — when a COM call like AdviseSupervisory was dispatched and the thread blocked waiting for the next work item, COM event callbacks (OnDataChange, OnWriteComplete) never fired because there was no active message pump during the wait. MxAccess works from MTA threads because COM marshaling handles cross-apartment calls, and events fire on their own threads. How to apply: Do not reintroduce STA threading for MxAccess. The System.Windows.Forms reference was removed from the Host csproj.

3. TypedValue property-level `_setCase` tracking

Plan specified: GetValueCase() heuristic checking non-default values (e.g., if (BoolValue) return BoolValue). Actual: Each property setter records _setCase = TypedValueCase.XxxValue, and GetValueCase() returns _setCase directly. Why: protobuf-net code-first has no native oneof support. The heuristic approach can't distinguish "field not set" from "field set to default value" (e.g., BoolValue = false, DoubleValue = 0.0, Int32Value = 0). Since protobuf-net calls property setters during deserialization, tracking in the setter correctly identifies which field was deserialized. How to apply: Always use GetValueCase() to determine which TypedValue field is set, never check for non-default values directly.

4. API key sent via HTTP header (DelegatingHandler)

Plan specified: API key sent in ConnectRequest.ApiKey field (request body). Actual: API key sent as x-api-key HTTP header on every gRPC request via ApiKeyDelegatingHandler, in addition to the request body. Why: The Host's ApiKeyInterceptor validates the x-api-key gRPC metadata header before any RPC handler executes. protobuf-net.Grpc's CreateGrpcService<T>() doesn't expose per-call metadata, so the header must be added at the HTTP transport level. A DelegatingHandler wrapping the SocketsHttpHandler adds it to all outgoing requests. How to apply: The GrpcChannelFactory.CreateChannel() accepts an optional apiKey parameter. The LmxProxyClient passes it during channel creation in ConnectAsync.

5. v2 test deployment on port 50100

Plan specified: Port 50052 for v2 test deployment. Actual: Port 50100. Why: Ports 50049–50060 are used by MxAccess internal COM connections (established TCP pairs between the COM client and server). Port 50052 was occupied by an ephemeral MxAccess connection from the v1 service. How to apply: When deploying alongside v1, use ports above 50100 to avoid MxAccess ephemeral port range.

6. CheckApiKey validates request body key

Plan specified: Not explicitly defined — the interceptor validates the header key. Actual: CheckApiKey RPC validates the key from the request body (request.ApiKey) against ApiKeyService, not the header key. Why: The x-api-key header always carries the caller's valid key (for interceptor auth). The CheckApiKey RPC is designed for clients to test whether a different key is valid, so it must check the body key independently. How to apply: ScadaGrpcService receives ApiKeyService as an optional constructor parameter.

7. Write uses fire-and-forget (OnWriteComplete callback not delivered)

Plan specified: Wait for OnWriteComplete COM callback to confirm write success. Actual: Write is confirmed by _lmxProxy.Write() returning without throwing. The OnWriteComplete callback is kept wired for diagnostic logging but never awaited. Why: The MxAccess documentation (Write() Method, p.47) explicitly states: "Upon completion of the write, your program receives notification of the success/failure status through the OnWriteComplete() event" and "that item should not be taken off advise or removed from the internal tables until the OnWriteComplete() event is received." So OnWriteComplete should fire — the issue is COM event delivery, not MxAccess behavior. The MxAccess sample applications are all Windows Forms apps with a UI message loop (Application.Run()). COM event callbacks are delivered via the Windows message pump. Our v2 Host runs as a headless Topshelf Windows service with no message loop. Write() is called from a thread pool thread (Task.Run), and the OnWriteComplete callback needs to be marshaled back to the calling apartment — which can't happen without a message pump. OnDataChange works because MxAccess fires it proactively on its own internal thread whenever data changes. OnWriteComplete is a response to a specific Write() call and appears to require message-pump-based marshaling to deliver. Risk: For simple supervisory writes, fire-and-forget is safe — if Write() returns without a COM exception, MxAccess accepted the write. However, for secured writes (error 1012) or verified writes (error 1013), OnWriteComplete is the only way to learn that the write was rejected and must be retried with WriteSecured(). If secured/verified writes are ever needed, this must be revisited — either by running a message pump on a dedicated thread or by using a polling-based confirmation. How to apply: Do not await OnWriteComplete for write confirmation. The Write() COM call succeeding (not throwing a COM exception) is the confirmation. Clean up (UnAdvise + RemoveItem) happens immediately after the write in a finally block. Keep OnWriteComplete wired — if COM threading is ever fixed (e.g., dedicated STA thread with proper message pump), the callback could be re-enabled.

8. SubscriptionManager must create MxAccess COM subscriptions

Plan specified: SubscriptionManager manages per-client channels and routes updates from MxAccess. Actual: SubscriptionManager must also call IScadaClient.SubscribeAsync() to create the underlying COM subscriptions when a tag is first subscribed, and dispose them when the last client unsubscribes. Why: The Phase 2 implementation tracked client-to-tag routing in internal dictionaries but never called MxAccessClient.SubscribeAsync() to create the actual MxAccess COM subscriptions (AddItem + AdviseSupervisory). Without the COM subscription, OnDataChange never fired and no updates were delivered to clients. This caused the Subscribe_ReceivesUpdates integration test to receive 0 updates over 30 seconds. How to apply: SubscriptionManager.Subscribe() collects newly-seen tags (those without an existing TagSubscription) and calls _scadaClient.SubscribeAsync() for them, passing OnTagValueChanged as the callback. The returned IAsyncDisposable handles are tracked in _mxAccessHandles per address and disposed in UnsubscribeClient() when the last client for a tag leaves.

Known Gaps

Gap 1: No active connection health probing

Status: Open. Requires implementation.

Problem: MxAccessClient.IsConnected checks _connectionState == Connected && _connectionHandle > 0. When the AVEVA platform (aaBootstrap) is killed or restarted, the MxAccess COM object and handle remain valid in memory — IsConnected stays true. The auto-reconnect monitor loop (MonitorConnectionAsync) only triggers when IsConnected is false, so it never attempts reconnection.

Observed behavior (tested 2026-03-22): After killing the aaBootstrap process, all reads returned null values with Bad quality indefinitely. The monitor loop kept seeing IsConnected == true and never reconnected. Even restarting the v2 service didn't help until the platform objects were manually restarted via the System Management Console.

Impact: After any platform disruption (AppEngine restart, aaBootstrap crash, platform redeploy), LmxProxy returns Bad quality on all reads/writes until the v2 service is manually restarted AND the platform objects are manually restarted. There is no automatic recovery.

Proposed fix: The monitor loop should actively probe the connection by reading a test tag (e.g., TestChildObject.TestBool or a configurable health tag). If the read returns null value or Bad quality for N consecutive probes, the monitor should:

Set IsConnected = false (transition to Disconnected or Error state)
Tear down the stale COM object (Unregister, ReleaseComObject)
Attempt full reconnect (ConnectAsync → creates new COM object → Register → RecreateStoredSubscriptionsAsync)

This matches the DetailedHealthCheckService pattern that already reads a test tag — the same logic should be embedded in the monitor loop.

Configuration: Add HealthCheck.TestTagAddress to appsettings.json (already exists, currently used only by DetailedHealthCheckService). The monitor loop would reuse this setting. Add HealthCheck.MaxConsecutiveFailures (default 3) — number of consecutive Bad probes before triggering reconnect.

Gap 2: Stale SubscriptionManager handles after reconnect

Status: Open. Minor — fails silently during cleanup.

Problem: When SubscriptionManager creates MxAccess subscriptions via _scadaClient.SubscribeAsync(), it stores IAsyncDisposable handles in _mxAccessHandles. After a platform disconnect/reconnect cycle, MxAccessClient.RecreateStoredSubscriptionsAsync() recreates COM subscriptions from _storedSubscriptions, but SubscriptionManager._mxAccessHandles still holds the old (now-invalid) handles.

Impact: When a client unsubscribes after a reconnect, SubscriptionManager.UnsubscribeClient() tries to dispose the stale handle, which calls MxAccessClient.UnsubscribeAsync() with addresses that may have different item handles in the new connection. The unsubscribe may fail silently or target wrong handles.

Proposed fix: Either:

(a) Have SubscriptionManager listen for ConnectionStateChanged events and clear _mxAccessHandles on disconnect (the recreated subscriptions from RecreateStoredSubscriptionsAsync don't produce new SubscriptionManager handles), or
(b) Have MxAccessClient notify SubscriptionManager after reconnect so it can re-register its handles.

Gap 3: AVEVA objects don't auto-start after platform crash

Status: Documented. Platform behavior, not an LmxProxy issue.

Observed behavior (tested 2026-03-22): After killing aaBootstrap, the service auto-restarted (via Windows SCM recovery or Watchdog) within seconds. However, the ArchestrA objects (TestChildObject) did not automatically start. MxAccess connected successfully (Register() returned a valid handle) but all tag reads returned null values with Bad quality for 40+ minutes. Objects only recovered after manual restart via the System Management Console (SMC).

Implication for LmxProxy: Even with Gap 1 fixed (active probing + reconnect), reads will still return Bad quality until the platform objects are running. LmxProxy cannot fix this — it's a platform-level recovery issue. The health check should report this clearly: "MxAccess connected but tag quality is Bad — platform objects may need manual restart."

Timeline: aaBootstrap restart from SMC (graceful) takes ~5 minutes for objects to come back. aaBootstrap kill (crash) requires manual object restart via SMC — objects do not auto-recover.

12 KiB Raw Blame History Unescape Escape