Files
Joseph Doherty 9dccf8e72f deprecate(lmxproxy): move all LmxProxy code, tests, and docs to deprecated/
LmxProxy is no longer needed. Moved the entire lmxproxy/ workspace, DCL
adapter files, and related docs to deprecated/. Removed LmxProxy registration
from DataConnectionFactory, project reference from DCL, protocol option from
UI, and cleaned up all requirement docs.
2026-04-08 15:56:23 -04:00

11 KiB
Raw Permalink Blame History

LmxProxy v2 Rebuild — Deviations & Key Technical Decisions

Decisions made during implementation that differ from or extend the original plan.

1. Grpc.Tools downgraded to 2.68.1

Plan specified: Grpc.Tools 2.71.0 Actual: 2.68.1 Why: protoc.exe from 2.71.0 crashes with access violation (exit code 0xC0000005) on windev (Windows 10, x64). The 2.68.1 version works reliably. How to apply: If upgrading Grpc.Tools in the future, test protoc on windev first.

2. STA threading — three iterations

Plan specified: Dedicated STA thread with BlockingCollection<Action> dispatch queue and Application.DoEvents() message pump. Iteration 1 (failed): StaDispatchThread with BlockingCollection.Take() + Application.DoEvents(). Failed because Take() blocked the STA thread, preventing the message pump from running. COM callbacks never fired. Iteration 2 (partial): Replaced with Task.Run on thread pool (MTA). OnDataChange worked (MxAccess fires it on its own threads), but OnWriteComplete never fired (needs message-pump-based marshaling). Writes used fire-and-forget as a workaround. Iteration 3 (current): StaComThread with Win32 GetMessage/DispatchMessage loop. Work dispatched via PostThreadMessage(WM_APP) which wakes the message pump. COM callbacks (OnDataChange, OnWriteComplete) are delivered between work items via DispatchMessage. All COM objects created and called on this single STA thread. How to apply: All MxAccess COM calls must go through _staThread.RunAsync(). Never call COM objects directly from thread pool threads. See docs/sta_gap.md for the full design analysis.

3. TypedValue property-level _setCase tracking

Plan specified: GetValueCase() heuristic checking non-default values (e.g., if (BoolValue) return BoolValue). Actual: Each property setter records _setCase = TypedValueCase.XxxValue, and GetValueCase() returns _setCase directly. Why: protobuf-net code-first has no native oneof support. The heuristic approach can't distinguish "field not set" from "field set to default value" (e.g., BoolValue = false, DoubleValue = 0.0, Int32Value = 0). Since protobuf-net calls property setters during deserialization, tracking in the setter correctly identifies which field was deserialized. How to apply: Always use GetValueCase() to determine which TypedValue field is set, never check for non-default values directly.

4. API key sent via HTTP header (DelegatingHandler)

Plan specified: API key sent in ConnectRequest.ApiKey field (request body). Actual: API key sent as x-api-key HTTP header on every gRPC request via ApiKeyDelegatingHandler, in addition to the request body. Why: The Host's ApiKeyInterceptor validates the x-api-key gRPC metadata header before any RPC handler executes. protobuf-net.Grpc's CreateGrpcService<T>() doesn't expose per-call metadata, so the header must be added at the HTTP transport level. A DelegatingHandler wrapping the SocketsHttpHandler adds it to all outgoing requests. How to apply: The GrpcChannelFactory.CreateChannel() accepts an optional apiKey parameter. The LmxProxyClient passes it during channel creation in ConnectAsync.

5. v2 test deployment on port 50100

Plan specified: Port 50052 for v2 test deployment. Actual: Port 50100. Why: Ports 5004950060 are used by MxAccess internal COM connections (established TCP pairs between the COM client and server). Port 50052 was occupied by an ephemeral MxAccess connection from the v1 service. How to apply: When deploying alongside v1, use ports above 50100 to avoid MxAccess ephemeral port range.

6. CheckApiKey validates request body key

Plan specified: Not explicitly defined — the interceptor validates the header key. Actual: CheckApiKey RPC validates the key from the request body (request.ApiKey) against ApiKeyService, not the header key. Why: The x-api-key header always carries the caller's valid key (for interceptor auth). The CheckApiKey RPC is designed for clients to test whether a different key is valid, so it must check the body key independently. How to apply: ScadaGrpcService receives ApiKeyService as an optional constructor parameter.

7. OnWriteComplete callback — resolved via STA message pump

Plan specified: Wait for OnWriteComplete COM callback to confirm write success. History: Initially implemented as fire-and-forget because OnWriteComplete never fired — the Host had no Windows message pump to deliver the COM callback. See docs/sta_gap.md for the full analysis. Resolution: StaComThread (a dedicated STA thread with a Win32 GetMessage/DispatchMessage loop) was introduced, providing a proper message pump. All COM operations are now dispatched to this thread via PostThreadMessage(WM_APP). The message pump delivers OnWriteComplete callbacks between work items. Current behavior: Write dispatches _lmxProxy.Write() on the STA thread, registers a TaskCompletionSource in _pendingWrites, then awaits the callback with a timeout. OnWriteComplete resolves or rejects the TCS with MxStatusMapper error details. If the callback doesn't arrive within the write timeout, falls back to success (fire-and-forget safety net). Clean up (UnAdvise + RemoveItem) happens on the STA thread after the callback or timeout. How to apply: Writes now get real confirmation from MxAccess. Secured write (1012) and verified write (1013) rejections are surfaced as exceptions via OnWriteComplete. The timeout fallback ensures writes don't hang if the callback is delayed.

8. SubscriptionManager must create MxAccess COM subscriptions

Plan specified: SubscriptionManager manages per-client channels and routes updates from MxAccess. Actual: SubscriptionManager must also call IScadaClient.SubscribeAsync() to create the underlying COM subscriptions when a tag is first subscribed, and dispose them when the last client unsubscribes. Why: The Phase 2 implementation tracked client-to-tag routing in internal dictionaries but never called MxAccessClient.SubscribeAsync() to create the actual MxAccess COM subscriptions (AddItem + AdviseSupervisory). Without the COM subscription, OnDataChange never fired and no updates were delivered to clients. This caused the Subscribe_ReceivesUpdates integration test to receive 0 updates over 30 seconds. How to apply: SubscriptionManager.SubscribeAsync() collects newly-seen tags (those without an existing TagSubscription) and awaits _scadaClient.SubscribeAsync() for them, passing OnTagValueChanged as the callback. The await ensures the COM subscription is fully established before the channel reader is returned — this prevents a race where the initial OnDataChange (first value delivery after AdviseSupervisory) fires before the gRPC stream handler starts reading. Previously this was fire-and-forget (_ = CreateMxAccessSubscriptionsAsync()), causing intermittent Subscribe_ReceivesUpdates test failures (0 updates in 30s).


Known Gaps

Gap 1: No active connection health probing

Status: Resolved (2026-03-22, commit a6c01d7).

Problem: MxAccessClient.IsConnected checks _connectionState == Connected && _connectionHandle > 0. When the AVEVA platform (aaBootstrap) is killed or restarted, the MxAccess COM object and handle remain valid in memory — IsConnected stays true. The auto-reconnect monitor loop (MonitorConnectionAsync) only triggers when IsConnected is false, so it never attempts reconnection.

Observed behavior (tested 2026-03-22): After killing the aaBootstrap process, all reads returned null values with Bad quality indefinitely. The monitor loop kept seeing IsConnected == true and never reconnected.

Fix implemented: The monitor loop now actively probes the connection using ProbeConnectionAsync, which reads a configurable test tag and classifies the result as Healthy, TransportFailure, or DataDegraded.

  • TransportFailure for N consecutive probes (default 3) → forced disconnect + full reconnect (new COM object, Register, RecreateStoredSubscriptionsAsync)
  • DataDegraded → stay connected, back off probe interval to 30s, report degraded status (platform objects may be stopped)
  • Healthy → reset counters, resume normal interval

Verified (tested 2026-03-22): Graceful platform stop via SMC → 4 failed probes → automatic reconnect → reads restored within ~60 seconds. All 17 integration tests pass after recovery. Subscribed clients receive Bad_NotConnected quality during outage, then Good quality resumes automatically.

Configuration (appsettings.jsonHealthCheck section):

  • TestTagAddress: Tag to probe (default TestChildObject.TestBool)
  • ProbeTimeoutMs: Probe read timeout (default 5000ms)
  • MaxConsecutiveTransportFailures: Failures before forced reconnect (default 3)
  • DegradedProbeIntervalMs: Probe interval in degraded mode (default 30000ms)

Gap 2: Stale SubscriptionManager handles after reconnect

Status: Resolved (2026-03-22, commit a6c01d7).

Problem: SubscriptionManager stored IAsyncDisposable handles from _scadaClient.SubscribeAsync() in _mxAccessHandles. After a reconnect, MxAccessClient.RecreateStoredSubscriptionsAsync() recreated COM subscriptions internally but SubscriptionManager._mxAccessHandles still held stale handles. Additionally, a batch subscription stored the same handle for every address — disposing one address would dispose the entire batch.

Fix implemented: Removed _mxAccessHandles entirely. SubscriptionManager no longer tracks COM subscription handles. Ownership is cleanly split:

  • SubscriptionManager owns client routing and ref-counting only
  • MxAccessClient owns COM subscription lifecycle via _storedSubscriptions and _addressToHandle
  • Unsubscribe uses _scadaClient.UnsubscribeByAddressAsync(addresses) — address-based, resolves to current handles regardless of reconnect history

Gap 3: AVEVA objects don't auto-start after platform crash

Status: Documented. Platform behavior, not an LmxProxy issue.

Observed behavior (tested 2026-03-22): After killing aaBootstrap, the service auto-restarted (via Windows SCM recovery or Watchdog) within seconds. However, the ArchestrA objects (TestChildObject) did not automatically start. MxAccess connected successfully (Register() returned a valid handle) but all tag reads returned null values with Bad quality for 40+ minutes. Objects only recovered after manual restart via the System Management Console (SMC).

Implication for LmxProxy: Even with Gap 1 fixed (active probing + reconnect), reads will still return Bad quality until the platform objects are running. LmxProxy cannot fix this — it's a platform-level recovery issue. The health check should report this clearly: "MxAccess connected but tag quality is Bad — platform objects may need manual restart."

Timeline: aaBootstrap restart from SMC (graceful) takes ~5 minutes for objects to come back. aaBootstrap kill (crash) requires manual object restart via SMC — objects do not auto-recover.