Files
scadalink-design/lmxproxy/docs/deviations.md
Joseph Doherty ec21a9a2a0 docs(lmxproxy): mark gap 1 and gap 2 as resolved with test verification
Gap 1: Active health probing verified — 60s recovery after platform restart.
Gap 2: Address-based subscription cleanup — no stale handles.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 07:10:38 -04:00

11 KiB
Raw Blame History

LmxProxy v2 Rebuild — Deviations & Key Technical Decisions

Decisions made during implementation that differ from or extend the original plan.

1. Grpc.Tools downgraded to 2.68.1

Plan specified: Grpc.Tools 2.71.0 Actual: 2.68.1 Why: protoc.exe from 2.71.0 crashes with access violation (exit code 0xC0000005) on windev (Windows 10, x64). The 2.68.1 version works reliably. How to apply: If upgrading Grpc.Tools in the future, test protoc on windev first.

2. STA Dispatch Thread replaced with Task.Run

Plan specified: Dedicated STA thread with BlockingCollection<Action> dispatch queue and Application.DoEvents() message pump for all COM operations. Actual: Task.Run on thread pool (MTA) for all COM operations, matching the v1 pattern. Why: The STA thread's message pump (Application.DoEvents()) between work items was insufficient — when a COM call like AdviseSupervisory was dispatched and the thread blocked waiting for the next work item, COM event callbacks (OnDataChange, OnWriteComplete) never fired because there was no active message pump during the wait. MxAccess works from MTA threads because COM marshaling handles cross-apartment calls, and events fire on their own threads. How to apply: Do not reintroduce STA threading for MxAccess. The System.Windows.Forms reference was removed from the Host csproj.

3. TypedValue property-level _setCase tracking

Plan specified: GetValueCase() heuristic checking non-default values (e.g., if (BoolValue) return BoolValue). Actual: Each property setter records _setCase = TypedValueCase.XxxValue, and GetValueCase() returns _setCase directly. Why: protobuf-net code-first has no native oneof support. The heuristic approach can't distinguish "field not set" from "field set to default value" (e.g., BoolValue = false, DoubleValue = 0.0, Int32Value = 0). Since protobuf-net calls property setters during deserialization, tracking in the setter correctly identifies which field was deserialized. How to apply: Always use GetValueCase() to determine which TypedValue field is set, never check for non-default values directly.

4. API key sent via HTTP header (DelegatingHandler)

Plan specified: API key sent in ConnectRequest.ApiKey field (request body). Actual: API key sent as x-api-key HTTP header on every gRPC request via ApiKeyDelegatingHandler, in addition to the request body. Why: The Host's ApiKeyInterceptor validates the x-api-key gRPC metadata header before any RPC handler executes. protobuf-net.Grpc's CreateGrpcService<T>() doesn't expose per-call metadata, so the header must be added at the HTTP transport level. A DelegatingHandler wrapping the SocketsHttpHandler adds it to all outgoing requests. How to apply: The GrpcChannelFactory.CreateChannel() accepts an optional apiKey parameter. The LmxProxyClient passes it during channel creation in ConnectAsync.

5. v2 test deployment on port 50100

Plan specified: Port 50052 for v2 test deployment. Actual: Port 50100. Why: Ports 5004950060 are used by MxAccess internal COM connections (established TCP pairs between the COM client and server). Port 50052 was occupied by an ephemeral MxAccess connection from the v1 service. How to apply: When deploying alongside v1, use ports above 50100 to avoid MxAccess ephemeral port range.

6. CheckApiKey validates request body key

Plan specified: Not explicitly defined — the interceptor validates the header key. Actual: CheckApiKey RPC validates the key from the request body (request.ApiKey) against ApiKeyService, not the header key. Why: The x-api-key header always carries the caller's valid key (for interceptor auth). The CheckApiKey RPC is designed for clients to test whether a different key is valid, so it must check the body key independently. How to apply: ScadaGrpcService receives ApiKeyService as an optional constructor parameter.

7. Write uses fire-and-forget (OnWriteComplete callback not delivered)

Plan specified: Wait for OnWriteComplete COM callback to confirm write success. Actual: Write is confirmed by _lmxProxy.Write() returning without throwing. The OnWriteComplete callback is kept wired for diagnostic logging but never awaited. Why: The MxAccess documentation (Write() Method, p.47) explicitly states: "Upon completion of the write, your program receives notification of the success/failure status through the OnWriteComplete() event" and "that item should not be taken off advise or removed from the internal tables until the OnWriteComplete() event is received." So OnWriteComplete should fire — the issue is COM event delivery, not MxAccess behavior. The MxAccess sample applications are all Windows Forms apps with a UI message loop (Application.Run()). COM event callbacks are delivered via the Windows message pump. Our v2 Host runs as a headless Topshelf Windows service with no message loop. Write() is called from a thread pool thread (Task.Run), and the OnWriteComplete callback needs to be marshaled back to the calling apartment — which can't happen without a message pump. OnDataChange works because MxAccess fires it proactively on its own internal thread whenever data changes. OnWriteComplete is a response to a specific Write() call and appears to require message-pump-based marshaling to deliver. Risk: For simple supervisory writes, fire-and-forget is safe — if Write() returns without a COM exception, MxAccess accepted the write. However, for secured writes (error 1012) or verified writes (error 1013), OnWriteComplete is the only way to learn that the write was rejected and must be retried with WriteSecured(). If secured/verified writes are ever needed, this must be revisited — either by running a message pump on a dedicated thread or by using a polling-based confirmation. How to apply: Do not await OnWriteComplete for write confirmation. The Write() COM call succeeding (not throwing a COM exception) is the confirmation. Clean up (UnAdvise + RemoveItem) happens immediately after the write in a finally block. Keep OnWriteComplete wired — if COM threading is ever fixed (e.g., dedicated STA thread with proper message pump), the callback could be re-enabled.

8. SubscriptionManager must create MxAccess COM subscriptions

Plan specified: SubscriptionManager manages per-client channels and routes updates from MxAccess. Actual: SubscriptionManager must also call IScadaClient.SubscribeAsync() to create the underlying COM subscriptions when a tag is first subscribed, and dispose them when the last client unsubscribes. Why: The Phase 2 implementation tracked client-to-tag routing in internal dictionaries but never called MxAccessClient.SubscribeAsync() to create the actual MxAccess COM subscriptions (AddItem + AdviseSupervisory). Without the COM subscription, OnDataChange never fired and no updates were delivered to clients. This caused the Subscribe_ReceivesUpdates integration test to receive 0 updates over 30 seconds. How to apply: SubscriptionManager.Subscribe() collects newly-seen tags (those without an existing TagSubscription) and calls _scadaClient.SubscribeAsync() for them, passing OnTagValueChanged as the callback. The returned IAsyncDisposable handles are tracked in _mxAccessHandles per address and disposed in UnsubscribeClient() when the last client for a tag leaves.


Known Gaps

Gap 1: No active connection health probing

Status: Resolved (2026-03-22, commit a6c01d7).

Problem: MxAccessClient.IsConnected checks _connectionState == Connected && _connectionHandle > 0. When the AVEVA platform (aaBootstrap) is killed or restarted, the MxAccess COM object and handle remain valid in memory — IsConnected stays true. The auto-reconnect monitor loop (MonitorConnectionAsync) only triggers when IsConnected is false, so it never attempts reconnection.

Observed behavior (tested 2026-03-22): After killing the aaBootstrap process, all reads returned null values with Bad quality indefinitely. The monitor loop kept seeing IsConnected == true and never reconnected.

Fix implemented: The monitor loop now actively probes the connection using ProbeConnectionAsync, which reads a configurable test tag and classifies the result as Healthy, TransportFailure, or DataDegraded.

  • TransportFailure for N consecutive probes (default 3) → forced disconnect + full reconnect (new COM object, Register, RecreateStoredSubscriptionsAsync)
  • DataDegraded → stay connected, back off probe interval to 30s, report degraded status (platform objects may be stopped)
  • Healthy → reset counters, resume normal interval

Verified (tested 2026-03-22): Graceful platform stop via SMC → 4 failed probes → automatic reconnect → reads restored within ~60 seconds. All 17 integration tests pass after recovery. Subscribed clients receive Bad_NotConnected quality during outage, then Good quality resumes automatically.

Configuration (appsettings.jsonHealthCheck section):

  • TestTagAddress: Tag to probe (default TestChildObject.TestBool)
  • ProbeTimeoutMs: Probe read timeout (default 5000ms)
  • MaxConsecutiveTransportFailures: Failures before forced reconnect (default 3)
  • DegradedProbeIntervalMs: Probe interval in degraded mode (default 30000ms)

Gap 2: Stale SubscriptionManager handles after reconnect

Status: Resolved (2026-03-22, commit a6c01d7).

Problem: SubscriptionManager stored IAsyncDisposable handles from _scadaClient.SubscribeAsync() in _mxAccessHandles. After a reconnect, MxAccessClient.RecreateStoredSubscriptionsAsync() recreated COM subscriptions internally but SubscriptionManager._mxAccessHandles still held stale handles. Additionally, a batch subscription stored the same handle for every address — disposing one address would dispose the entire batch.

Fix implemented: Removed _mxAccessHandles entirely. SubscriptionManager no longer tracks COM subscription handles. Ownership is cleanly split:

  • SubscriptionManager owns client routing and ref-counting only
  • MxAccessClient owns COM subscription lifecycle via _storedSubscriptions and _addressToHandle
  • Unsubscribe uses _scadaClient.UnsubscribeByAddressAsync(addresses) — address-based, resolves to current handles regardless of reconnect history

Gap 3: AVEVA objects don't auto-start after platform crash

Status: Documented. Platform behavior, not an LmxProxy issue.

Observed behavior (tested 2026-03-22): After killing aaBootstrap, the service auto-restarted (via Windows SCM recovery or Watchdog) within seconds. However, the ArchestrA objects (TestChildObject) did not automatically start. MxAccess connected successfully (Register() returned a valid handle) but all tag reads returned null values with Bad quality for 40+ minutes. Objects only recovered after manual restart via the System Management Console (SMC).

Implication for LmxProxy: Even with Gap 1 fixed (active probing + reconnect), reads will still return Bad quality until the platform objects are running. LmxProxy cannot fix this — it's a platform-level recovery issue. The health check should report this clearly: "MxAccess connected but tag quality is Bad — platform objects may need manual restart."

Timeline: aaBootstrap restart from SMC (graceful) takes ~5 minutes for objects to come back. aaBootstrap kill (crash) requires manual object restart via SMC — objects do not auto-recover.