chore(lmxproxy): switch health probe tag to DevAppEngine.Scheduler.ScanTime, remove temp prompts

AppEngine built-in tag is always present and constantly updating (~1s),
making it a more reliable probe than a user-deployed TestChildObject tag.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-03-22 07:18:39 -04:00
parent ec21a9a2a0
commit 2c99b370a0
6 changed files with 4 additions and 316 deletions

View File

@@ -1,119 +0,0 @@
# LmxProxy v2 — Fix Failing Write Integration Tests
Run this prompt with Claude Code from the `lmxproxy/` directory.
## Prompt
You are debugging and fixing 2 failing integration tests for the LmxProxy v2 gRPC proxy service. The tests are `WriteAndReadBack` and `WriteBatchAndWait` in the integration test project.
### Context
Read these documents before starting:
1. `CLAUDE.md` — project-level instructions and architecture
2. `docs/deviations.md` — deviation #7 describes the failure (OnWriteComplete COM callback not firing)
3. `mxaccess_documentation.md` — the official MxAccess Toolkit reference. Search for "OnWriteComplete", "Write() method", and "AdviseSupervisory" sections.
Read these source files:
4. `src/ZB.MOM.WW.LmxProxy.Host/MxAccess/MxAccessClient.ReadWrite.cs` — current v2 write implementation
5. `src/ZB.MOM.WW.LmxProxy.Host/MxAccess/MxAccessClient.EventHandlers.cs` — OnWriteComplete callback handler
6. `src/ZB.MOM.WW.LmxProxy.Host/MxAccess/MxAccessClient.Connection.cs` — COM event wiring (`_lmxProxy.OnWriteComplete += OnWriteComplete`)
7. `src-reference/ZB.MOM.WW.LmxProxy.Host/Implementation/MxAccessClient.ReadWrite.cs` — v1 write implementation (for comparison)
8. `src-reference/ZB.MOM.WW.LmxProxy.Host/Implementation/MxAccessClient.EventHandlers.cs` — v1 OnWriteComplete handler
9. `tests/ZB.MOM.WW.LmxProxy.Client.IntegrationTests/WriteTests.cs` — failing WriteAndReadBack test
10. `tests/ZB.MOM.WW.LmxProxy.Client.IntegrationTests/WriteBatchAndWaitTests.cs` — failing WriteBatchAndWait test
### Problem Statement
The `OnWriteComplete` COM callback from MxAccess never fires within the write timeout (default 5s). The write operation follows this flow:
1. `AddItem()` — registers the tag with MxAccess
2. `AdviseSupervisory()` — establishes a supervisory connection for the tag
3. Store a `TaskCompletionSource<bool>` in `_pendingWrites[itemHandle]`
4. `Write()` — sends the write to MxAccess
5. Wait for `OnWriteComplete` callback to resolve the TCS — **this never fires, causing a timeout**
The v1 code used the same pattern and presumably worked, so the issue is either:
- (a) MxAccess completes the write synchronously and never fires `OnWriteComplete` for simple (non-secured, non-verified) writes. The documentation says: "The LMXProxyInterface triggers an event for OnWriteComplete when your program calls the Write() or WriteSecured() function." However, this may not be true in all configurations or for all attribute types.
- (b) The COM event subscription wiring (`_lmxProxy.OnWriteComplete += OnWriteComplete`) is correct syntactically but the callback doesn't fire because the thread that called `Write()` (a thread pool thread via `Task.Run`) isn't pumping COM messages. Note: STA threading was abandoned (see deviation #2 in `docs/deviations.md`) because it caused other issues with `OnDataChange` callbacks.
- (c) There's a difference in how v1 vs v2 initializes or interacts with the COM object that affects event delivery.
### Investigation Steps
These steps require SSH access to windev where the v2 Host is deployed.
**Step 1: Check if the write actually succeeds despite no callback.**
The `WriteAndReadBack` test writes a value and then reads it back. The test fails because `WriteAsync` throws a `TimeoutException` (OnWriteComplete never fires). But the write itself may have succeeded at the MxAccess level.
SSH to windev (`ssh windev`) and:
- Check the v2 Host service logs in `C:\publish-v2\logs\` for any write-related log entries
- Look for "Write failed", "WriteComplete", or "timeout" messages
**Step 2: Add a fire-and-forget write mode.**
If the write succeeds at the MxAccess level but `OnWriteComplete` never fires, the simplest fix is to bypass the callback wait. Modify `MxAccessClient.ReadWrite.cs`:
- After calling `_lmxProxy.Write()`, immediately resolve the TCS with success instead of waiting for the callback
- Keep the `OnWriteComplete` handler wired up — if it does fire, it can log the result for diagnostics but shouldn't block the write path
- Add a configuration option `WriteConfirmationMode` with values `FireAndForget` (default) and `WaitForCallback`, so the behavior can be switched if needed
The rationale: the MxAccess documentation's sample application (Chapter 6) uses `OnWriteComplete` to detect whether a *secured* or *verified* write is needed, then retries with `WriteSecured()`. For simple supervisory writes (which is what LmxProxy does), the Write() call itself is the confirmation — if it doesn't throw, the write was accepted.
**Step 3: Implement the fix.**
In `MxAccessClient.ReadWrite.cs`, modify `SetupWriteOperationAsync`:
```
// After _lmxProxy.Write():
// Immediately complete the write — OnWriteComplete may not fire for supervisory writes
tcs.TrySetResult(true);
```
Remove the `_pendingWrites[itemHandle] = tcs` tracking since it's no longer needed for the default path. Keep `OnWriteComplete` wired for logging/diagnostics.
Clean up `WaitForWriteCompletionAsync` — in fire-and-forget mode, the TCS is already completed so the await returns immediately. The cleanup (UnAdvise + RemoveItem) should still happen.
**Step 4: Consider an alternative approach — poll-based write confirmation.**
If fire-and-forget is too loose (we want to confirm the write took effect), consider a poll-based approach similar to `WriteBatchAndWaitAsync`: after writing, read the tag back and compare. This is more reliable than depending on a COM callback. However, this adds latency and may not be needed — the `WriteAndReadBack` test already does a read-back verification.
**Step 5: Build and deploy.**
After making changes:
```bash
ssh windev "cd C:\source\lmxproxy && git pull && dotnet build src\ZB.MOM.WW.LmxProxy.Host -c Release -r win-x86 --no-self-contained -o C:\publish-v2"
```
Restart the v2 service:
```bash
ssh windev "net stop LmxProxyV2 && net start LmxProxyV2"
```
**Step 6: Run the integration tests.**
From the Mac:
```bash
cd tests/ZB.MOM.WW.LmxProxy.Client.IntegrationTests
dotnet test --filter "WriteAndReadBack|WriteBatchAndWait" -v n
```
Both tests should pass. If they do, run the full integration test suite:
```bash
dotnet test -v n
```
**Step 7: Update deviations document.**
If the fix works, update `docs/deviations.md` deviation #7 to reflect the resolution. Change the status from pending to resolved and document what was done.
### Guardrails
1. **Do not reintroduce STA threading.** The MTA/Task.Run approach works for `OnDataChange` callbacks and subscriptions. Do not change the threading model.
2. **Do not modify the integration tests** unless they have a genuine bug (e.g., wrong assertion, wrong tag name). The tests define the expected behavior — fix the implementation to match.
3. **Do not modify the proto file or gRPC contracts.** This is a Host-side implementation fix only.
4. **Keep the OnWriteComplete handler wired.** Even if we don't wait on it, the callback provides diagnostic information (security errors, verified write requirements) that should be logged.
5. **Commit with message:** `fix(lmxproxy): resolve write timeout — bypass OnWriteComplete callback for supervisory writes`

View File

@@ -1,98 +0,0 @@
# LmxProxy v2 Rebuild — Execution Prompt
Run this prompt with Claude Code from the `lmxproxy/` directory to execute all 7 phases of the rebuild autonomously.
## Prompt
You are executing a pre-approved implementation plan for rebuilding the LmxProxy gRPC proxy service. All design decisions have been made and documented. You do NOT need to ask for approval — execute each phase completely, then move to the next.
### Context
Read these documents in order before starting:
1. `docs/plans/2026-03-21-lmxproxy-v2-rebuild-design.md` — the approved design
2. `CLAUDE.md` — project-level instructions
3. `docs/requirements/HighLevelReqs.md` — high-level requirements
4. `docs/requirements/Component-*.md` — all component requirements (10 files)
5. `docs/lmxproxy_updates.md` — authoritative v2 protocol specification
### Execution Order
Execute phases in this exact order. Each phase has a detailed plan in `docs/plans/`:
1. **Phase 1**: `docs/plans/phase-1-protocol-domain-types.md`
2. **Phase 2**: `docs/plans/phase-2-host-core.md`
3. **Phase 3**: `docs/plans/phase-3-host-grpc-security-config.md`
4. **Phase 4**: `docs/plans/phase-4-host-health-metrics.md`
5. **Phase 5**: `docs/plans/phase-5-client-core.md`
6. **Phase 6**: `docs/plans/phase-6-client-extras.md`
7. **Phase 7**: `docs/plans/phase-7-integration-deployment.md`
### How to Execute Each Phase
For each phase:
1. Read the phase plan document completely before writing any code.
2. Read any referenced requirements documents for that phase.
3. Execute each step in the plan in order.
4. After all steps, run `dotnet build` and `dotnet test` to verify.
5. If build or tests fail, fix the issues before proceeding.
6. Commit the phase with message: `feat(lmxproxy): phase N — <description>`
7. Push to remote: `git push`
8. Move to the next phase.
### Guardrails (MUST follow)
1. **Proto is the source of truth** — any wire format question is resolved by reading `src/ZB.MOM.WW.LmxProxy.Host/Grpc/Protos/scada.proto`, not the code-first contracts.
2. **No v1 code in the new build** — the `src-reference/` directory is for reading only. Do not copy-paste and modify; write fresh code guided by the plan.
3. **Cross-stack tests in Phase 1** — Host proto serialize to Client code-first deserialize (and vice versa) must pass before any business logic.
4. **COM calls only on STA dispatch thread** — no `Task.Run` for COM operations. All go through the `StaDispatchThread` dispatch queue.
5. **status_code is canonical for quality**`symbolic_name` is always derived from lookup, never independently set.
6. **Unit tests before integration** — every phase includes unit tests. Integration tests are Phase 7 only.
7. **Each phase must compile and pass tests** before the next phase begins. Do not skip failing tests.
8. **No string serialization heuristics** — v2 uses native TypedValue. No `double.TryParse` or `bool.TryParse` on values.
9. **Do not modify requirements or design docs** — if you find a conflict, follow the design doc's resolution (section 11).
10. **Do not ask for user approval** — all decisions are pre-approved in the design document.
### Error Recovery
- If a build fails, read the error messages carefully, fix the code, and rebuild.
- If a test fails, fix the implementation (not the test) unless the test has a clear bug.
- If a step in the plan is ambiguous, consult the requirements document for that component.
- If the requirements are ambiguous, consult the design document's resolution table (section 11).
- If you cannot resolve an issue after 3 attempts, skip that step, leave a `// TODO: <description>` comment, and continue.
### Phase 7 Special Instructions
Phase 7 requires SSH access to windev (10.100.0.48). See `windev.md` in the repo root for connection details:
- SSH: `ssh windev` (passwordless)
- Default shell: cmd.exe, use `powershell -Command` for PowerShell
- Git and .NET SDK 10 are installed
- The existing v1 LmxProxy service is at `C:\publish\` on port 50051
For Veeam backups, SSH to the Veeam server:
- SSH: `ssh dohertj2@10.100.0.30` (passwordless)
- Use `Add-PSSnapin VeeamPSSnapin` for Veeam PowerShell
### Commit Messages
Use this format for each phase commit:
- Phase 1: `feat(lmxproxy): phase 1 — v2 protocol types and domain model`
- Phase 2: `feat(lmxproxy): phase 2 — host core (MxAccessClient, SessionManager, SubscriptionManager)`
- Phase 3: `feat(lmxproxy): phase 3 — host gRPC server, security, configuration, service hosting`
- Phase 4: `feat(lmxproxy): phase 4 — host health monitoring, metrics, status web server`
- Phase 5: `feat(lmxproxy): phase 5 — client core (ILmxProxyClient, connection, read/write/subscribe)`
- Phase 6: `feat(lmxproxy): phase 6 — client extras (builder, factory, DI, streaming extensions)`
- Phase 7: `feat(lmxproxy): phase 7 — integration tests, deployment to windev, v1 cutover`
### After All Phases
When all 7 phases are complete:
1. Run `dotnet build ZB.MOM.WW.LmxProxy.slnx` to verify the full solution builds.
2. Run `dotnet test` to verify all unit tests pass.
3. Verify the integration tests passed in Phase 7.
4. Create a final commit if any cleanup was needed.
5. Push all changes.
6. Report: total files created, total tests, build status, integration test results.

View File

@@ -1,95 +0,0 @@
# LmxProxy Requirements Documentation Prompt
Use this prompt with Claude Code to generate the requirements documentation for the LmxProxy project. Run from the `lmxproxy/` directory.
## Prompt
Create requirements documentation for the LmxProxy project in `docs/requirements/`. Follow the same structure used in the ScadaLink project (`docs/requirements/` in the parent repo) — a high-level requirements doc and per-component breakout documents.
### Context
LmxProxy is a gRPC proxy service that bridges SCADA clients to industrial automation systems (primarily AVEVA/Wonderware System Platform via ArchestrA.MXAccess). It consists of two projects:
1. **ZB.MOM.WW.LmxProxy.Host** — A .NET Framework 4.8 Windows service (Topshelf) that runs on the same machine as System Platform. It connects to MXAccess (COM interop, x86) and exposes a gRPC server for remote SCADA operations (read, write, subscribe, batch operations). It handles session management, API key authentication, TLS, health checks, performance metrics, and subscription management.
2. **ZB.MOM.WW.LmxProxy.Client** — A .NET 10 class library providing a typed gRPC client for consuming the LmxProxy service. It uses protobuf-net.Grpc (code-first, no .proto files). It includes connection management, retry policies, TLS support, streaming extensions, DI integration, and a builder pattern for configuration.
### What to Generate
**1. `docs/requirements/HighLevelReqs.md`** — High-level requirements covering:
- System purpose and architecture (proxy pattern, why it exists)
- Deployment model (runs on System Platform machine, clients connect remotely)
- Communication protocol (gRPC, HTTP/2, code-first and proto-based)
- Session lifecycle (connect, session ID, disconnect, no idle timeout)
- Authentication model (API key via metadata header, configurable enforcement)
- TLS/security model (optional TLS, mutual TLS support, certificate validation)
- Data model (VTQ — Value/Timestamp/Quality, OPC-style quality codes)
- Operations (read, read batch, write, write batch, write-and-wait, subscribe)
- Subscription model (server-streaming, tag-based, sampling interval)
- Health monitoring and metrics
- Service hosting (Topshelf Windows service, service recovery)
- Configuration (appsettings.json sections)
- Scale considerations
- Protocol versioning (v1 string-based, v2 OPC UA-aligned typed values)
**2. Component documents** — One `Component-<Name>.md` for each logical component:
- **Component-GrpcServer.md** — The gRPC service implementation (ScadaGrpcService). Session validation, request routing to MxAccessClient, subscription lifecycle, error handling, proto-based serialization.
- **Component-MxAccessClient.md** — The MXAccess COM interop wrapper. Connection lifecycle (Become/Stash-like state machine), tag registration, read/write operations, subscription via advise callbacks, event handling, x86/COM threading constraints. This is the core component.
- **Component-SessionManager.md** — Client session tracking, session creation/destruction, session-to-client mapping, concurrent session limits.
- **Component-Security.md** — API key authentication (ApiKeyService, ApiKeyInterceptor), key file management, role-based permissions (ReadOnly/ReadWrite), TLS certificate management.
- **Component-SubscriptionManager.md** — Tag subscription lifecycle, channel-based update delivery, sampling intervals, backpressure (channel full modes), subscription cleanup on disconnect.
- **Component-Configuration.md** — appsettings.json structure, configuration validation, TLS configuration, service recovery configuration, connection timeouts, retry policies.
- **Component-HealthAndMetrics.md** — Health check service (test tag reads, stale data detection), performance metrics (operation counts, latencies, percentiles), status web server (HTTP status endpoint).
- **Component-ServiceHost.md** — Topshelf service hosting, Program.cs entry point, Serilog logging setup, service install/uninstall, service recovery (Windows SCM restart policies).
- **Component-Client.md** — The LmxProxyClient library. Builder pattern, connection management, retry with Polly, keep-alive pings, streaming extensions, DI registration (ServiceCollectionExtensions), factory pattern, TLS configuration.
- **Component-Protocol.md** — The gRPC protocol specification. Proto definition, code-first contracts (IScadaService), message schemas, VTQ format, quality codes, v1 vs v2 differences.
### Document Structure (per component)
Each component doc must follow this structure exactly:
```
# Component: <Name>
## Purpose
<1-2 sentence description>
## Location
<Which project(s) and key files>
## Responsibilities
<Bulleted list of what this component does>
## <Detail sections>
<Numbered or named sections with specific design details>
## Dependencies
<What this component depends on>
## Interactions
<How this component interacts with others>
```
### Sources
Derive requirements from:
- The source code in `src/ZB.MOM.WW.LmxProxy.Host/` and `src/ZB.MOM.WW.LmxProxy.Client/`
- The protocol docs in `docs/lmxproxy_protocol.md` and `docs/lmxproxy_updates.md`
- The appsettings.json configuration files
### Rules
- Write requirements as design decisions, not aspirational statements. Describe what the system **does**, not what it **should** do.
- Include specific values from configuration (ports, timeouts, intervals, limits).
- Cross-reference between documents using component names.
- Keep the high-level doc focused on system-wide concerns; push implementation details to component docs.
- Do not invent features not present in the source code.

View File

@@ -31,8 +31,8 @@ namespace ZB.MOM.WW.LmxProxy.Host.Configuration
/// <summary>Health check / probe configuration.</summary> /// <summary>Health check / probe configuration.</summary>
public class HealthCheckConfiguration public class HealthCheckConfiguration
{ {
/// <summary>Tag address to probe for connection liveness. Default: TestChildObject.TestBool.</summary> /// <summary>Tag address to probe for connection liveness. Default: DevAppEngine.Scheduler.ScanTime.</summary>
public string TestTagAddress { get; set; } = "TestChildObject.TestBool"; public string TestTagAddress { get; set; } = "DevAppEngine.Scheduler.ScanTime";
/// <summary>Probe timeout in milliseconds. Default: 5000.</summary> /// <summary>Probe timeout in milliseconds. Default: 5000.</summary>
public int ProbeTimeoutMs { get; set; } = 5000; public int ProbeTimeoutMs { get; set; } = 5000;

View File

@@ -20,7 +20,7 @@ namespace ZB.MOM.WW.LmxProxy.Host.Health
public DetailedHealthCheckService( public DetailedHealthCheckService(
IScadaClient scadaClient, IScadaClient scadaClient,
string testTagAddress = "TestChildObject.TestBool") string testTagAddress = "DevAppEngine.Scheduler.ScanTime")
{ {
_scadaClient = scadaClient; _scadaClient = scadaClient;
_testTagAddress = testTagAddress; _testTagAddress = testTagAddress;

View File

@@ -33,7 +33,7 @@
}, },
"HealthCheck": { "HealthCheck": {
"TestTagAddress": "TestChildObject.TestBool", "TestTagAddress": "DevAppEngine.Scheduler.ScanTime",
"ProbeTimeoutMs": 5000, "ProbeTimeoutMs": 5000,
"MaxConsecutiveTransportFailures": 3, "MaxConsecutiveTransportFailures": 3,
"DegradedProbeIntervalMs": 30000 "DegradedProbeIntervalMs": 30000