Files
scadalink-design/lmxproxy/docs/plans/2026-03-21-lmxproxy-v2-rebuild-design.md
Joseph Doherty 4303f06fc3 docs(lmxproxy): add v2 rebuild design, 7-phase implementation plans, and execution prompt
Design doc covers architecture, v2 protocol (TypedValue/QualityCode), COM threading
model, session lifecycle, subscription semantics, error model, and guardrails.
Implementation plans are detailed enough for autonomous Claude Code execution.
Verified all dev tooling on windev (Grpc.Tools, protobuf-net.Grpc, Polly v8, xUnit).
2026-03-21 23:29:42 -04:00

211 lines
10 KiB
Markdown

# LmxProxy v2 Rebuild — Design Document
**Date**: 2026-03-21
**Status**: Approved
**Scope**: Complete rebuild of LmxProxy Host and Client with v2 protocol
## 1. Overview
Rebuild the LmxProxy gRPC proxy service from scratch, implementing the v2 protocol (TypedValue + QualityCode) as defined in `docs/lmxproxy_updates.md`. The existing code in `src/` is retained as reference only. No backward compatibility with v1.
## 2. Key Design Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| gRPC server for Host | Grpc.Core (C-core) | Only option for .NET Framework 4.8 server-side |
| Service hosting | Topshelf | Proven, already deployed, simple install/uninstall |
| Protocol version | v2 only, clean break | Small controlled client count, no value in v1 compat |
| Shared code between projects | None — fully independent | Different .NET runtimes (.NET Fx 4.8 vs .NET 10), wire compat is the contract |
| Client retry library | Polly v8+ | Building fresh on .NET 10, modern API |
| Testing strategy | Unit tests during implementation, integration tests after Client functional | Phased approach, real hardware validation on windev |
## 3. Architecture
### 3.1 Host (.NET Framework 4.8, x86)
```
Program.cs (Topshelf entry point)
└── LmxProxyService (lifecycle manager)
├── Configuration (appsettings.json binding + validation)
├── MxAccessClient (COM interop, STA dispatch thread)
│ ├── Connection state machine
│ ├── Read/Write with semaphore concurrency
│ ├── Subscription storage for reconnect replay
│ └── Auto-reconnect loop (5s interval)
├── SessionManager (ConcurrentDictionary, 5-min inactivity scavenging)
├── SubscriptionManager (per-client channels, shared MxAccess subscriptions)
├── ApiKeyService (JSON file, FileSystemWatcher hot-reload)
├── ScadaGrpcService (proto-generated, all 10 RPCs)
│ └── ApiKeyInterceptor (x-api-key header enforcement)
├── PerformanceMetrics (per-op tracking, p95, 60s log)
├── HealthCheckService (basic + detailed with test tag)
└── StatusWebServer (HTML dashboard, JSON status, health endpoint)
```
### 3.2 Client (.NET 10, AnyCPU)
```
ILmxProxyClient (public interface)
└── LmxProxyClient (partial class)
├── Connection (GrpcChannel, protobuf-net.Grpc, 30s keep-alive)
├── Read/Write/Subscribe operations
├── CodeFirstSubscription (IAsyncEnumerable streaming)
├── ClientMetrics (p95/p99, 1000-sample buffer)
└── Disposal (session disconnect, channel cleanup)
LmxProxyClientBuilder (fluent builder, Polly v8 resilience pipeline)
ILmxProxyClientFactory + LmxProxyClientFactory (config-based creation)
ServiceCollectionExtensions (DI registrations)
StreamingExtensions (batched reads/writes, parallel processing)
Domain/
├── ScadaContracts.cs (IScadaService + all DataContract messages)
├── Quality.cs, QualityExtensions.cs
├── Vtq.cs
└── ConnectionState.cs
```
### 3.3 Wire Compatibility
The `.proto` file is the single source of truth for the wire format. Host generates server stubs from it. Client implements code-first contracts (`[DataContract]`/`[ServiceContract]`) that mirror the proto exactly — same field numbers, names, nesting, and streaming shapes. Cross-stack serialization tests verify compatibility.
## 4. Protocol (v2)
### 4.1 TypedValue System
Protobuf `oneof` carrying native types:
| Case | Proto Type | .NET Type |
|------|-----------|-----------|
| bool_value | bool | bool |
| int32_value | int32 | int |
| int64_value | int64 | long |
| float_value | float | float |
| double_value | double | double |
| string_value | string | string |
| bytes_value | bytes | byte[] |
| datetime_value | int64 (UTC Ticks) | DateTime |
| array_value | ArrayValue | typed arrays |
Unset `oneof` = null. No string serialization heuristics.
### 4.2 COM Variant Coercion Table
| COM Variant Type | TypedValue Case | Notes |
|-----------------|-----------------|-------|
| VT_BOOL | bool_value | |
| VT_I2 (short) | int32_value | Widened |
| VT_I4 (int) | int32_value | |
| VT_I8 (long) | int64_value | |
| VT_UI2 (ushort) | int32_value | Widened |
| VT_UI4 (uint) | int64_value | Widened to avoid sign issues |
| VT_UI8 (ulong) | int64_value | Truncation risk logged if > long.MaxValue |
| VT_R4 (float) | float_value | |
| VT_R8 (double) | double_value | |
| VT_BSTR (string) | string_value | |
| VT_DATE (DateTime) | datetime_value | Converted to UTC Ticks |
| VT_DECIMAL | double_value | Precision loss logged |
| VT_CY (Currency) | double_value | |
| VT_NULL, VT_EMPTY, DBNull | unset oneof | Represents null |
| VT_ARRAY | array_value | Element type determines ArrayValue field |
| VT_UNKNOWN | string_value | ToString() fallback, logged as warning |
### 4.3 QualityCode System
`status_code` (uint32, OPC UA-compatible) is canonical. `symbolic_name` is derived from a lookup table, never set independently.
Category derived from high bits:
- `0x00xxxxxx` = Good
- `0x40xxxxxx` = Uncertain
- `0x80xxxxxx` = Bad
Domain `Quality` enum uses byte values for the low-order byte, with extension methods `IsGood()`, `IsBad()`, `IsUncertain()`.
### 4.4 Error Model
| Error Type | Mechanism | Examples |
|-----------|-----------|----------|
| Infrastructure | gRPC StatusCode | Unauthenticated (bad API key), PermissionDenied (ReadOnly write), InvalidArgument (bad session), Unavailable (MxAccess down) |
| Business outcome | Payload `success`/`message` fields | Tag read failure, write type mismatch, batch partial failure, WriteBatchAndWait flag timeout |
| Subscription | gRPC StatusCode on stream | Unauthenticated (invalid session), Internal (unexpected error) |
## 5. COM Threading Model
MxAccess is an STA COM component. All COM operations execute on a **dedicated STA thread** with a `BlockingCollection<Action>` dispatch queue:
- `MxAccessClient` creates a single STA thread at construction
- All COM calls (connect, read, write, subscribe, disconnect) are dispatched to this thread via the queue
- Callers await a `TaskCompletionSource<T>` that the STA thread completes after the COM call
- The STA thread runs a message pump loop (`Application.Run` or manual `MSG` pump)
- On disposal, a sentinel is enqueued and the thread joins with a 10-second timeout
This replaces the fragile `Task.Run` + `SemaphoreSlim` pattern in the reference code.
## 6. Session Lifecycle
- Sessions created on `Connect` with GUID "N" format (32-char hex)
- Tracked in `ConcurrentDictionary<string, SessionInfo>`
- **Inactivity scavenging**: sessions not accessed for 5 minutes are automatically terminated. Client keep-alive pings (30s) keep legitimate sessions alive.
- On termination: subscriptions cleaned up, session removed from dictionary
- All sessions lost on service restart (in-memory only)
## 7. Subscription Semantics
- **Shared MxAccess subscriptions**: first client to subscribe creates the underlying MxAccess subscription. Last to unsubscribe disposes it. Ref-counted.
- **Sampling rate**: when multiple clients subscribe to the same tag with different `sampling_ms`, the fastest (lowest non-zero) rate is used for the MxAccess subscription. All clients receive updates at this rate.
- **Per-client channels**: each client gets an independent `BoundedChannel<VtqMessage>` (capacity 1000, DropOldest). One slow consumer's drops do not affect other clients.
- **MxAccess disconnect**: all subscribed clients receive a bad-quality notification for all their subscribed tags.
- **Session termination**: all subscriptions for that session are cleaned up.
## 8. Authentication
- `x-api-key` gRPC metadata header is the authoritative authentication mechanism
- `ConnectRequest.api_key` is accepted but the interceptor is the enforcement point
- API keys loaded from JSON file with FileSystemWatcher hot-reload (1-second debounce)
- Auto-generates default file with two random keys (ReadOnly + ReadWrite) if missing
- Write-protected RPCs: Write, WriteBatch, WriteBatchAndWait
## 9. Phasing
| Phase | Scope | Depends On |
|-------|-------|------------|
| 1 | Protocol & Domain Types | — |
| 2 | Host Core (MxAccessClient, SessionManager, SubscriptionManager) | Phase 1 |
| 3 | Host gRPC Server, Security, Configuration, Service Hosting | Phase 2 |
| 4 | Host Health, Metrics, Status Server | Phase 3 |
| 5 | Client Core | Phase 1 |
| 6 | Client Extras (Builder, Factory, DI, Streaming) | Phase 5 |
| 7 | Integration Tests & Deployment | Phases 4 + 6 |
Phases 2-4 (Host) and 5-6 (Client) can proceed in parallel after Phase 1.
## 10. Guardrails
1. **Proto is the source of truth** — any wire format question is resolved by reading `scada.proto`, not the code-first contracts.
2. **No v1 code in the new build** — reference only. Do not copy-paste and modify; write fresh.
3. **Cross-stack tests in Phase 1** — Host proto serialize → Client code-first deserialize (and vice versa) before any business logic.
4. **COM calls only on STA thread** — no `Task.Run` for COM operations. All go through the dispatch queue.
5. **status_code is canonical for quality**`symbolic_name` is always derived, never independently set.
6. **Unit tests before integration** — every phase includes unit tests. Integration tests are Phase 7 only.
7. **Each phase must compile and pass tests** before the next phase begins.
8. **No string serialization heuristics** — v2 uses native TypedValue. No `double.TryParse` or `bool.TryParse` on values.
## 11. Resolved Conflicts
| Conflict | Resolution |
|----------|-----------|
| WriteBatchAndWait signature (MxAccessClient vs Protocol) | Follow Protocol spec: write items, poll flagTag for flagValue. IScadaClient interface matches protocol semantics. |
| Builder default port 5050 vs Host 50051 | Standardize builder default to 50051 |
| Auth in metadata vs payload | x-api-key header is authoritative; ConnectRequest.api_key accepted but interceptor enforces |
## 12. Reference Code
The existing code remains in `src/` as `src-reference/` for consultation:
- `src-reference/ZB.MOM.WW.LmxProxy.Host/` — v1 Host implementation
- `src-reference/ZB.MOM.WW.LmxProxy.Client/` — v1 Client implementation
Key reference files for COM interop patterns:
- `Implementation/MxAccessClient.Connection.cs` — COM object lifecycle
- `Implementation/MxAccessClient.EventHandlers.cs` — MxAccess callbacks
- `Implementation/MxAccessClient.Subscription.cs` — Advise/Unadvise patterns