docs: add LmxProxy requirements documentation with v2 protocol as authoritative design

Generate high-level requirements and 10 component documents derived from source code
and protocol specs. Uses lmxproxy_updates.md (v2 TypedValue/QualityCode) as the source
of truth, with v1 string-based encoding documented as legacy context.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-03-21 22:38:11 -04:00
parent 970d0a5cb3
commit 683aea0fbe
12 changed files with 1702 additions and 0 deletions

View File

@@ -0,0 +1,274 @@
# LmxProxy - High Level Requirements
## 1. System Purpose
LmxProxy is a gRPC proxy service that bridges SCADA clients to AVEVA System Platform (Wonderware) via the ArchestrA MXAccess COM API. It exists because MXAccess is a 32-bit COM component that requires co-location with System Platform on a Windows machine running .NET Framework 4.8. LmxProxy isolates this constraint behind a gRPC interface, allowing modern .NET clients to access System Platform data remotely over HTTP/2.
## 2. Architecture
### 2.1 Two-Project Structure
- **ZB.MOM.WW.LmxProxy.Host** — .NET Framework 4.8, x86-only Windows service. Hosts a gRPC server (Grpc.Core) fronting the MXAccess COM API. Runs on the same machine as AVEVA System Platform.
- **ZB.MOM.WW.LmxProxy.Client** — .NET 10, AnyCPU class library. Code-first gRPC client (protobuf-net.Grpc) consumed by ScadaLink's Data Connection Layer. Packaged as a NuGet library.
### 2.2 Dual gRPC Stacks
The two projects use different gRPC implementations that are wire-compatible:
- **Host**: Proto-file-generated code via `Grpc.Core` + `Grpc.Tools`. Uses the deprecated C-core gRPC library because .NET Framework 4.8 does not support `Grpc.Net.Server`.
- **Client**: Code-first contracts via `protobuf-net.Grpc` with `[DataContract]`/`[ServiceContract]` attributes over `Grpc.Net.Client`.
Both target the same `scada.ScadaService` gRPC service definition and are wire-compatible.
### 2.3 Deployment Model
- The Host service runs on the AVEVA System Platform machine (or any machine with MXAccess access).
- Clients connect remotely over gRPC (HTTP/2) on a configurable port (default 50051).
- The Host runs as a Windows service managed by Topshelf.
## 3. Communication Protocol
### 3.1 Transport
- gRPC over HTTP/2.
- Default server port: 50051.
- Optional TLS with mutual TLS (mTLS) support.
### 3.2 RPCs
The service exposes 10 RPCs:
| RPC | Type | Description |
|-----|------|-------------|
| Connect | Unary | Establish session, returns session ID |
| Disconnect | Unary | Terminate session |
| GetConnectionState | Unary | Query MxAccess connection status |
| Read | Unary | Read single tag value |
| ReadBatch | Unary | Read multiple tag values |
| Write | Unary | Write single tag value |
| WriteBatch | Unary | Write multiple tag values |
| WriteBatchAndWait | Unary | Write values, poll flag tag until match or timeout |
| Subscribe | Server streaming | Stream tag value updates to client |
| CheckApiKey | Unary | Validate API key and return role |
### 3.3 Data Model (VTQ)
All tag values are represented as VTQ (Value, Timestamp, Quality) tuples:
- **Value**: `TypedValue` — a protobuf `oneof` carrying the value in its native type (bool, int32, int64, float, double, string, bytes, datetime, typed arrays). An unset `oneof` represents null.
- **Timestamp**: UTC `DateTime.Ticks` as `int64` (100-nanosecond intervals since 0001-01-01 00:00:00 UTC).
- **Quality**: `QualityCode` — a structured message with `uint32 status_code` (OPC UA-compatible) and `string symbolic_name`. Category derived from high bits: `0x00xxxxxx` = Good, `0x40xxxxxx` = Uncertain, `0x80xxxxxx` = Bad.
## 4. Session Lifecycle
- Clients call `Connect` with a client ID and optional API key to establish a session.
- The server returns a 32-character hex GUID as the session ID.
- All subsequent operations require the session ID for validation.
- Sessions persist until explicit `Disconnect` or server restart. There is no idle timeout.
- Session state is tracked in memory (not persisted). All sessions are lost on service restart.
## 5. Authentication & Authorization
### 5.1 API Key Authentication
- API keys are validated via the `x-api-key` gRPC metadata header.
- Keys are stored in a JSON file (`apikeys.json` by default) with hot-reload via FileSystemWatcher (1-second debounce).
- If no API key file exists, the service auto-generates a default file with two random keys (one ReadOnly, one ReadWrite).
- Authentication is enforced at the gRPC interceptor level before any service method executes.
### 5.2 Role-Based Authorization
Two roles with hierarchical permissions:
| Role | Read | Subscribe | Write |
|------|------|-----------|-------|
| ReadOnly | Yes | Yes | No |
| ReadWrite | Yes | Yes | Yes |
Write-protected methods: `Write`, `WriteBatch`, `WriteBatchAndWait`. A ReadOnly key attempting a write receives `StatusCode.PermissionDenied`.
### 5.3 TLS/Security
- TLS is optional (disabled by default in configuration, though `Tls.Enabled` defaults to `true` in the config class).
- Supports server TLS and mutual TLS (client certificate validation).
- Client CA certificate path configurable for mTLS.
- Certificate revocation checking is optional.
- Client library supports TLS 1.2 and TLS 1.3, custom CA trust stores, self-signed certificate allowance, and server name override.
## 6. Operations
### 6.1 Read
- Single tag read with configurable retry policy.
- Batch read with semaphore-controlled concurrency (default max 10 concurrent operations).
- Read timeout: 5 seconds (configurable).
### 6.2 Write
- Single tag write with retry policy. Values are sent as `TypedValue` (native protobuf types). Type mismatches between the value and the tag's expected type return a write failure.
- Batch write with semaphore-controlled concurrency.
- Write timeout: 5 seconds (configurable).
- WriteBatchAndWait: writes a batch, then polls the flag tag at a configurable interval until its value matches the expected flag value (type-aware comparison via `TypedValueEquals`) or a timeout expires. Default timeout: 5000ms, default poll interval: 100ms. Timeout is not an error — returns `flag_reached=false`.
### 6.3 Subscribe
- Server-streaming RPC. Client sends a list of tags and a sampling interval (in milliseconds).
- Server maintains a per-client bounded channel (default capacity 1000 messages).
- Updates are pushed as `VtqMessage` items on the stream.
- When the MxAccess connection drops, all subscribed clients receive a bad-quality notification.
- Subscriptions are cleaned up on client disconnect. When the last client unsubscribes from a tag, the underlying MxAccess subscription is disposed.
## 7. Connection Resilience
### 7.1 Host Auto-Reconnect
- If the MxAccess connection is lost, the Host automatically attempts reconnection at a fixed interval (default 5 seconds).
- Stored subscriptions are recreated after a successful reconnect.
- Auto-reconnect is configurable (`Connection.AutoReconnect`, default true).
### 7.2 Client Keep-Alive
- The client sends a lightweight `GetConnectionState` ping every 30 seconds.
- On keep-alive failure, the client marks the connection as disconnected and cleans up subscriptions.
### 7.3 Client Retry Policy
- Polly-based exponential backoff retry.
- Default: 3 attempts with 1-second initial delay (1s → 2s → 4s).
- Transient errors retried: Unavailable, DeadlineExceeded, ResourceExhausted, Aborted.
## 8. Health Monitoring & Metrics
### 8.1 Health Checks
Two health check implementations:
- **Basic** (`HealthCheckService`): Checks MxAccess connection state, subscription stats, and operation success rate. Returns Degraded if success rate < 50% (with > 100 operations) or client count > 100.
- **Detailed** (`DetailedHealthCheckService`): Reads a test tag (`System.Heartbeat`). Returns Unhealthy if not connected, Degraded if test tag quality is not Good or timestamp is older than 5 minutes.
### 8.2 Performance Metrics
- Per-operation tracking: Read, ReadBatch, Write, WriteBatch.
- Metrics: total count, success count, success rate, average/min/max latency, 95th percentile latency.
- Rolling buffer of 1000 latency samples per operation for percentile calculation.
- Metrics reported to logs every 60 seconds.
### 8.3 Status Web Server
- HTTP status server on port 8080 (configurable).
- Endpoints:
- `GET /` — HTML dashboard with auto-refresh (30 seconds), color-coded status cards, operations table.
- `GET /api/status` — JSON status report.
- `GET /api/health` — Plain text `OK` (200) or `UNHEALTHY` (503).
### 8.4 Client Metrics
- Per-operation counts, error counts, and latency tracking (average, p95, p99).
- Rolling buffer of 1000 latency samples.
- Exposed via `ILmxProxyClient.GetMetrics()`.
## 9. Service Hosting
### 9.1 Topshelf Windows Service
- Service name: `ZB.MOM.WW.LmxProxy.Host`
- Display name: `SCADA Bridge LMX Proxy`
- Starts automatically on boot.
### 9.2 Service Recovery (Windows SCM)
| Failure | Restart Delay |
|---------|--------------|
| First | 1 minute |
| Second | 5 minutes |
| Subsequent | 10 minutes |
| Reset period | 1 day |
### 9.3 Startup Sequence
1. Load configuration from `appsettings.json` + environment variables.
2. Configure Serilog (console + file sinks).
3. Validate configuration.
4. Check/generate TLS certificates (if TLS enabled).
5. Initialize services: PerformanceMetrics, ApiKeyService, MxAccessClient, SubscriptionManager, SessionManager, HealthCheckService, StatusReportService.
6. Connect to MxAccess synchronously (timeout: 30 seconds).
7. Start auto-reconnect monitor loop (if enabled).
8. Start gRPC server on configured port.
9. Start HTTP status web server.
### 9.4 Shutdown Sequence
1. Cancel reconnect monitor (5-second wait).
2. Graceful gRPC server shutdown (10-second timeout, then kill).
3. Stop status web server (5-second wait).
4. Dispose all components in reverse order.
5. Disconnect from MxAccess (10-second timeout).
## 10. Configuration
All configuration is via `appsettings.json` bound to `LmxProxyConfiguration`. Key settings:
| Section | Setting | Default |
|---------|---------|---------|
| Root | GrpcPort | 50051 |
| Root | ApiKeyConfigFile | `apikeys.json` |
| Connection | MonitorIntervalSeconds | 5 |
| Connection | ConnectionTimeoutSeconds | 30 |
| Connection | ReadTimeoutSeconds | 5 |
| Connection | WriteTimeoutSeconds | 5 |
| Connection | MaxConcurrentOperations | 10 |
| Connection | AutoReconnect | true |
| Subscription | ChannelCapacity | 1000 |
| Subscription | ChannelFullMode | DropOldest |
| Tls | Enabled | false |
| Tls | RequireClientCertificate | false |
| WebServer | Enabled | true |
| WebServer | Port | 8080 |
Configuration is validated at startup. Invalid values cause the service to fail to start.
## 11. Logging
- Serilog with console and file sinks.
- File sink: `logs/lmxproxy-.txt`, daily rolling, 30 files retained.
- Default level: Information. Overrides: Microsoft=Warning, System=Warning, Grpc=Information.
- Enrichment: FromLogContext, WithMachineName, WithThreadId.
## 12. Constraints
- Host **must** target x86 and .NET Framework 4.8 (MXAccess is 32-bit COM).
- Host uses `Grpc.Core` (deprecated C-core library), required because .NET 4.8 does not support `Grpc.Net.Server`.
- Client targets .NET 10 and runs in ScadaLink central/site clusters.
- MxAccess COM operations require STA thread context (wrapped in `Task.Run`).
- The solution file uses `.slnx` format.
## 13. Protocol
The protocol specification is defined in `lmxproxy_updates.md`, which is the authoritative source of truth. All RPC signatures, message schemas, and behavioral specifications are per that document.
### 13.1 Value System (TypedValue)
Values are transmitted in their native protobuf types via a `TypedValue` oneof: bool, int32, int64, float, double, string, bytes, datetime (int64 UTC Ticks), and typed arrays. An unset oneof represents null. No string serialization or parsing heuristics are used.
### 13.2 Quality System (QualityCode)
Quality is a structured `QualityCode` message with `uint32 status_code` (OPC UA-compatible) and `string symbolic_name`. Supports AVEVA-aligned quality sub-codes (e.g., `BadSensorFailure` = `0x806D0000`, `GoodLocalOverride` = `0x00D80000`, `BadWaitingForInitialData` = `0x80320000`). See Component-Protocol for the full quality code table.
### 13.3 Migration from V1
The current codebase implements the v1 protocol (string-encoded values, three-state string quality). The v2 protocol is a clean break — all clients and servers will be updated simultaneously. No backward compatibility layer. This is appropriate because LmxProxy is an internal protocol with a small, controlled client count.
## 14. Component List (10 Components)
| # | Component | Description |
|---|-----------|-------------|
| 1 | GrpcServer | gRPC service implementation, session validation, request routing |
| 2 | MxAccessClient | MXAccess COM interop wrapper, connection lifecycle, read/write/subscribe |
| 3 | SessionManager | Client session tracking and lifecycle |
| 4 | Security | API key authentication, role-based authorization, TLS management |
| 5 | SubscriptionManager | Tag subscription lifecycle, channel-based update delivery, backpressure |
| 6 | Configuration | appsettings.json structure, validation, options binding |
| 7 | HealthAndMetrics | Health checks, performance metrics, status web server |
| 8 | ServiceHost | Topshelf hosting, startup/shutdown, logging setup, service recovery |
| 9 | Client | LmxProxyClient library, builder, retry, streaming, DI integration |
| 10 | Protocol | gRPC protocol specification, proto definition, code-first contracts |