scadalink-design/lmxproxy/docs/deviations.md

# LmxProxy v2 Rebuild — Deviations & Key Technical Decisions

Decisions made during implementation that differ from or extend the original plan.

## 1. Grpc.Tools downgraded to 2.68.1

**Plan specified**: Grpc.Tools 2.71.0
**Actual**: 2.68.1
**Why**: protoc.exe from 2.71.0 crashes with access violation (exit code 0xC0000005) on windev (Windows 10, x64). The 2.68.1 version works reliably.
**How to apply**: If upgrading Grpc.Tools in the future, test protoc on windev first.

## 2. STA threading — three iterations

**Plan specified**: Dedicated STA thread with `BlockingCollection<Action>` dispatch queue and `Application.DoEvents()` message pump.
**Iteration 1 (failed)**: `StaDispatchThread` with `BlockingCollection.Take()` + `Application.DoEvents()`. Failed because `Take()` blocked the STA thread, preventing the message pump from running. COM callbacks never fired.
**Iteration 2 (partial)**: Replaced with `Task.Run` on thread pool (MTA). `OnDataChange` worked (MxAccess fires it on its own threads), but `OnWriteComplete` never fired (needs message-pump-based marshaling). Writes used fire-and-forget as a workaround.
**Iteration 3 (current)**: `StaComThread` with Win32 `GetMessage`/`DispatchMessage` loop. Work dispatched via `PostThreadMessage(WM_APP)` which wakes the message pump. COM callbacks (`OnDataChange`, `OnWriteComplete`) are delivered between work items via `DispatchMessage`. All COM objects created and called on this single STA thread.
**How to apply**: All MxAccess COM calls must go through `_staThread.RunAsync()`. Never call COM objects directly from thread pool threads. See `docs/sta_gap.md` for the full design analysis.

## 3. TypedValue property-level `_setCase` tracking

**Plan specified**: `GetValueCase()` heuristic checking non-default values (e.g., `if (BoolValue) return BoolValue`).
**Actual**: Each property setter records `_setCase = TypedValueCase.XxxValue`, and `GetValueCase()` returns `_setCase` directly.
**Why**: protobuf-net code-first has no native `oneof` support. The heuristic approach can't distinguish "field not set" from "field set to default value" (e.g., `BoolValue = false`, `DoubleValue = 0.0`, `Int32Value = 0`). Since protobuf-net calls property setters during deserialization, tracking in the setter correctly identifies which field was deserialized.
**How to apply**: Always use `GetValueCase()` to determine which TypedValue field is set, never check for non-default values directly.

## 4. API key sent via HTTP header (DelegatingHandler)

**Plan specified**: API key sent in `ConnectRequest.ApiKey` field (request body).
**Actual**: API key sent as `x-api-key` HTTP header on every gRPC request via `ApiKeyDelegatingHandler`, in addition to the request body.
**Why**: The Host's `ApiKeyInterceptor` validates the `x-api-key` gRPC metadata header before any RPC handler executes. protobuf-net.Grpc's `CreateGrpcService<T>()` doesn't expose per-call metadata, so the header must be added at the HTTP transport level. A `DelegatingHandler` wrapping the `SocketsHttpHandler` adds it to all outgoing requests.
**How to apply**: The `GrpcChannelFactory.CreateChannel()` accepts an optional `apiKey` parameter. The `LmxProxyClient` passes it during channel creation in `ConnectAsync`.

## 5. v2 test deployment on port 50100

**Plan specified**: Port 50052 for v2 test deployment.
**Actual**: Port 50100.
**Why**: Ports 50049–50060 are used by MxAccess internal COM connections (established TCP pairs between the COM client and server). Port 50052 was occupied by an ephemeral MxAccess connection from the v1 service.
**How to apply**: When deploying alongside v1, use ports above 50100 to avoid MxAccess ephemeral port range.

## 6. CheckApiKey validates request body key

**Plan specified**: Not explicitly defined — the interceptor validates the header key.
**Actual**: `CheckApiKey` RPC validates the key from the *request body* (`request.ApiKey`) against `ApiKeyService`, not the header key.
**Why**: The `x-api-key` header always carries the caller's valid key (for interceptor auth). The `CheckApiKey` RPC is designed for clients to test whether a *different* key is valid, so it must check the body key independently.
**How to apply**: `ScadaGrpcService` receives `ApiKeyService` as an optional constructor parameter.

## 7. OnWriteComplete callback — resolved via STA message pump

**Plan specified**: Wait for `OnWriteComplete` COM callback to confirm write success.
**History**: Initially implemented as fire-and-forget because `OnWriteComplete` never fired — the Host had no Windows message pump to deliver the COM callback. See `docs/sta_gap.md` for the full analysis.
**Resolution**: `StaComThread` (a dedicated STA thread with a Win32 `GetMessage`/`DispatchMessage` loop) was introduced, providing a proper message pump. All COM operations are now dispatched to this thread via `PostThreadMessage(WM_APP)`. The message pump delivers `OnWriteComplete` callbacks between work items.
**Current behavior**: Write dispatches `_lmxProxy.Write()` on the STA thread, registers a `TaskCompletionSource` in `_pendingWrites`, then awaits the callback with a timeout. `OnWriteComplete` resolves or rejects the TCS with `MxStatusMapper` error details. If the callback doesn't arrive within the write timeout, falls back to success (fire-and-forget safety net). Clean up (UnAdvise + RemoveItem) happens on the STA thread after the callback or timeout.
**How to apply**: Writes now get real confirmation from MxAccess. Secured write (1012) and verified write (1013) rejections are surfaced as exceptions via `OnWriteComplete`. The timeout fallback ensures writes don't hang if the callback is delayed.

## 8. SubscriptionManager must create MxAccess COM subscriptions

**Plan specified**: SubscriptionManager manages per-client channels and routes updates from MxAccess.
**Actual**: SubscriptionManager must also call `IScadaClient.SubscribeAsync()` to create the underlying COM subscriptions when a tag is first subscribed, and dispose them when the last client unsubscribes.
**Why**: The Phase 2 implementation tracked client-to-tag routing in internal dictionaries but never called `MxAccessClient.SubscribeAsync()` to create the actual MxAccess COM subscriptions (`AddItem` + `AdviseSupervisory`). Without the COM subscription, `OnDataChange` never fired and no updates were delivered to clients. This caused the `Subscribe_ReceivesUpdates` integration test to receive 0 updates over 30 seconds.
**How to apply**: `SubscriptionManager.SubscribeAsync()` collects newly-seen tags (those without an existing `TagSubscription`) and **awaits** `_scadaClient.SubscribeAsync()` for them, passing `OnTagValueChanged` as the callback. The await ensures the COM subscription is fully established before the channel reader is returned — this prevents a race where the initial `OnDataChange` (first value delivery after `AdviseSupervisory`) fires before the gRPC stream handler starts reading. Previously this was fire-and-forget (`_ = CreateMxAccessSubscriptionsAsync()`), causing intermittent `Subscribe_ReceivesUpdates` test failures (0 updates in 30s).

---

# Known Gaps

## Gap 1: No active connection health probing

**Status**: Resolved (2026-03-22, commit `a6c01d7`).

**Problem**: `MxAccessClient.IsConnected` checks `_connectionState == Connected && _connectionHandle > 0`. When the AVEVA platform (aaBootstrap) is killed or restarted, the MxAccess COM object and handle remain valid in memory — `IsConnected` stays `true`. The auto-reconnect monitor loop (`MonitorConnectionAsync`) only triggers when `IsConnected` is `false`, so it never attempts reconnection.

**Observed behavior** (tested 2026-03-22): After killing the aaBootstrap process, all reads returned null values with Bad quality indefinitely. The monitor loop kept seeing `IsConnected == true` and never reconnected.

**Fix implemented**: The monitor loop now actively probes the connection using `ProbeConnectionAsync`, which reads a configurable test tag and classifies the result as `Healthy`, `TransportFailure`, or `DataDegraded`.
- `TransportFailure` for N consecutive probes (default 3) → forced disconnect + full reconnect (new COM object, `Register`, `RecreateStoredSubscriptionsAsync`)
- `DataDegraded` → stay connected, back off probe interval to 30s, report degraded status (platform objects may be stopped)
- `Healthy` → reset counters, resume normal interval

**Verified** (tested 2026-03-22): Graceful platform stop via SMC → 4 failed probes → automatic reconnect → reads restored within ~60 seconds. All 17 integration tests pass after recovery. Subscribed clients receive `Bad_NotConnected` quality during outage, then Good quality resumes automatically.

**Configuration** (`appsettings.json` → `HealthCheck` section):
- `TestTagAddress`: Tag to probe (default `TestChildObject.TestBool`)
- `ProbeTimeoutMs`: Probe read timeout (default 5000ms)
- `MaxConsecutiveTransportFailures`: Failures before forced reconnect (default 3)
- `DegradedProbeIntervalMs`: Probe interval in degraded mode (default 30000ms)

## Gap 2: Stale SubscriptionManager handles after reconnect

**Status**: Resolved (2026-03-22, commit `a6c01d7`).

**Problem**: `SubscriptionManager` stored `IAsyncDisposable` handles from `_scadaClient.SubscribeAsync()` in `_mxAccessHandles`. After a reconnect, `MxAccessClient.RecreateStoredSubscriptionsAsync()` recreated COM subscriptions internally but `SubscriptionManager._mxAccessHandles` still held stale handles. Additionally, a batch subscription stored the same handle for every address — disposing one address would dispose the entire batch.

**Fix implemented**: Removed `_mxAccessHandles` entirely. `SubscriptionManager` no longer tracks COM subscription handles. Ownership is cleanly split:
- `SubscriptionManager` owns client routing and ref-counting only
- `MxAccessClient` owns COM subscription lifecycle via `_storedSubscriptions` and `_addressToHandle`
- Unsubscribe uses `_scadaClient.UnsubscribeByAddressAsync(addresses)` — address-based, resolves to current handles regardless of reconnect history

## Gap 3: AVEVA objects don't auto-start after platform crash

**Status**: Documented. Platform behavior, not an LmxProxy issue.

**Observed behavior** (tested 2026-03-22): After killing aaBootstrap, the service auto-restarted (via Windows SCM recovery or Watchdog) within seconds. However, the ArchestrA objects (TestChildObject) did not automatically start. MxAccess connected successfully (`Register()` returned a valid handle) but all tag reads returned null values with Bad quality for 40+ minutes. Objects only recovered after manual restart via the System Management Console (SMC).

**Implication for LmxProxy**: Even with Gap 1 fixed (active probing + reconnect), reads will still return Bad quality until the platform objects are running. LmxProxy cannot fix this — it's a platform-level recovery issue. The health check should report this clearly: "MxAccess connected but tag quality is Bad — platform objects may need manual restart."

**Timeline**: aaBootstrap restart from SMC (graceful) takes ~5 minutes for objects to come back. aaBootstrap kill (crash) requires manual object restart via SMC — objects do not auto-recover.