Commit Graph

5 Commits

Author SHA1 Message Date
Joseph Doherty 53e3973209 Resolve Worker-001, Worker-002, Worker-003 code-review findings
Worker-001: WnWrapAlarmConsumer armed a System.Threading.Timer whose OnPoll
callback ran GetXmlCurrentAlarms2 on a thread-pool thread against the
Apartment-threaded wnwrap COM object, which can deadlock on cross-apartment
marshaling. Removed the pollTimer/pollIntervalMs fields, OnPoll, the
poll-interval constructor parameter, and the timer arm/disposal. Polls are
driven externally by the STA via StaRuntime.InvokeAsync(PollOnce).

Worker-002: RunHeartbeatLoopAsync delayed a full HeartbeatInterval before
the first heartbeat. Restructured so the first beat is sent immediately on
entering the loop and the delay applies only between subsequent beats.

Worker-003: ProcessCommandAsync silently returned without a reply when
_state was not a command-serving state after dispatch. Both drop sites now
log a WorkerCommandResultDropped diagnostic with correlation_id via
IWorkerLogger; _state is now volatile.

Three pre-existing tests that asserted strict frame ordering were updated to
tolerate an interleaved first heartbeat (Worker-002 consequence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:59:46 -04:00
Joseph Doherty a67a5a4857 fix(worker): wire alarm command handler and STA poll loop (Gap 1 + Gap 2)
Gap 1 — WorkerPipeSession now passes `eq => new AlarmCommandHandler(eq)` as
the alarmCommandHandlerFactory in all three places it constructs
MxAccessStaSession (two convenience constructors and InitializeMxAccessAsync).
Previously the parameterless MxAccessStaSession() set the factory to null,
so every SubscribeAlarms / AcknowledgeAlarm / QueryActiveAlarms command
returned "alarm consumer not configured" in a deployed worker.

  - Added internal `MxAccessStaSession(Func<MxAccessEventQueue, IAlarmCommandHandler>?)`
    constructor that builds all defaults but accepts a factory.
  - Added public `MxAccessStaSession(StaRuntime, factory, eventQueue, alarmFactory?)`
    4-arg overload to complete the constructor chain.

Gap 2 — WnWrapAlarmConsumer now disables its internal threadpool Timer
(pollIntervalMilliseconds=0 in the default constructor). MxAccessStaSession
starts a `RunAlarmPollLoopAsync` background task that sleeps off-STA then
calls `staRuntime.InvokeAsync(() => handler.PollOnce())` at 500ms intervals.
This satisfies the ThreadingModel=Apartment requirement of wwAlarmConsumerClass:
every GetXmlCurrentAlarms2 call now runs on the worker's STA.

  - Added `PollOnce()` to `IMxAccessAlarmConsumer`, `AlarmDispatcher`,
    `IAlarmCommandHandler`, and `AlarmCommandHandler`.
  - Poll loop cancelled and awaited before alarm handler disposal in both
    ShutdownGracefullyAsync and Dispose.

Tests: 4 new tests in MxAccessStaSessionTests verify that
  - SubscribeAlarms reaches the handler when the factory is wired (Gap 1)
  - SubscribeAlarms returns InvalidRequest without a factory (regression guard)
  - PollOnce is called on the STA thread within 3s (Gap 2)
  - The poll loop stops after Dispose (Gap 2 lifecycle)
All fake IMxAccessAlarmConsumer / IAlarmCommandHandler test implementations
updated with no-op PollOnce() to satisfy the new interface member.

Worker tests: 199 passed / 1 pre-existing failure / 4 skipped (was 195/1/4).
Server tests: 308 passed / 0 failures (unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 06:30:14 -04:00
Joseph Doherty a4ed605f74 A.3 (live smoke): full alarms-over-gateway pipeline verified end-to-end
Skip-gated AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip ran
against the dev rig with the flip script firing
TestMachine_001.TestAlarm001 every 10s. Verified:
  - Subscribe + 1st PollOnce yield real transition events
  - Field-by-field decode correct (provider, group, tag, severity,
    UTC timestamp, comment, type)
  - SnapshotActiveAlarms reflects current state
  - AcknowledgeByName(real identity) -> rc=0
  - Pipeline keeps streaming transitions on the 10s cadence post-ack

Three production quirks surfaced and were fixed in
WnWrapAlarmConsumer:

1. SetXmlAlarmQuery is mandatory for reads. Skipping it (per the
   earlier discovery-doc recommendation) makes the first
   GetXmlCurrentAlarms2 fail with E_FAIL. The doc's claim that the
   call is unnecessary because the round-trip echo is mangled was
   wrong — mangled echo or not, the call is required.

2. SetXmlAlarmQuery breaks AlarmAckByName on the same consumer
   instance (returns -55). Workaround: provision a parallel
   "ack-only" wnwrap consumer that runs Initialize → Register →
   Subscribe via the v1-prefixed methods, no SetXmlAlarmQuery.
   Production WnWrapAlarmConsumer now holds two COM clients;
   AcknowledgeByName always dispatches through the ack-only one.

3. AlarmAckByName has v2 (8-arg) and v1 (6-arg) overloads. The v2
   8-arg overload returns -55 on this AVEVA build (apparently a
   stub); the v1 6-arg overload works. Production now calls the
   6-arg overload, discarding the proto's operator_domain and
   operator_full_name fields. The proto contract keeps both for
   forward-compat if AVEVA fixes the v2 method.

Bonus finding (not fixed here): AlarmAckByGUID throws
NotImplementedException on wnwrap. Reference→GUID lookup that we
initially planned to plumb is therefore not viable; all acks must
go through AlarmAckByName. WorkerAlarmRpcDispatcher.AcknowledgeAsync
already routes references through the by-name path, so this only
affects the GUID-input branch (which the worker tries first if the
input parses as a GUID — that branch will surface
NotImplementedException as MxaccessFailure if a client supplies one).

Threading caveat: wnwrap is ThreadingModel=Apartment, so the
consumer's internal Timer (firing on threadpool threads) blocks on
cross-apartment marshaling without an STA message pump. The smoke
test sidesteps this with pollIntervalMilliseconds=0 (Timer disabled)
+ manual PollOnce calls from the test STA. Production hosting will
route polls through the worker's StaRuntime in a follow-up; PollOnce
is now public so the wire-up is straightforward.

Test counts after this slice:
  Worker: 195 pass / 4 skipped (live probes incl. new live smoke) /
          1 pre-existing structure-fail (untouched)
  Server: 308 pass / 0 fail
Solution builds clean.

docs/AlarmClientDiscovery.md "Live smoke-test discoveries" section
records all five findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:17:39 -04:00
Joseph Doherty 4e02927f01 A.3 (alarm-ack-by-name): public AcknowledgeAlarm now accepts Provider!Group.Tag references
Closes the gap where the public AcknowledgeAlarm RPC required canonical
GUIDs but OnAlarmTransitionEvent.AlarmFullReference is "Provider!Group.Tag".
Adds an AVEVA AlarmAckByName path that wraps wwAlarmConsumerClass.AlarmAckByName
so callers can ack with the natural reference.

Proto:
- New MxCommandKind.AcknowledgeAlarmByName (=29).
- New AcknowledgeAlarmByNameCommand(alarm_name, provider_name, group_name,
  comment, operator_user/node/domain/full_name) on MxCommand oneof.
- AcknowledgeAlarmReplyPayload (existing) carries the AVEVA native
  status; reused for the by-name path.

Worker:
- IMxAccessAlarmConsumer + WnWrapAlarmConsumer + AlarmDispatcher +
  AlarmCommandHandler all gain an AcknowledgeByName(name, provider,
  group, comment, operator-identity) overload that maps to
  wwAlarmConsumerClass.AlarmAckByName.
- MxAccessCommandExecutor: new switch arm routes
  MxCommandKind.AcknowledgeAlarmByName to the handler. Empty alarm_name
  yields InvalidRequest; handler exceptions surface as MxaccessFailure.

Gateway:
- WorkerAlarmRpcDispatcher.TryParseAlarmReference: parses
  "Provider!Group.Tag" with the convention that the FIRST '!' separates
  provider, the FIRST '.' after '!' separates group; tag may contain
  more dots.
- AcknowledgeAsync now branches: GUID input → AcknowledgeAlarm command
  (existing path); reference input → AcknowledgeAlarmByName command
  (new path); neither parses → InvalidRequest with a clear diagnostic.

Tests: 13 new unit tests cover each layer end-to-end:
- WorkerAlarmRpcDispatcher.TryParseAlarmReference (3 valid + 8 invalid
  forms) including the realistic 4-component "Galaxy!TestArea.
  TestMachine_001.TestAlarm001" reference.
- WorkerAlarmRpcDispatcher.AcknowledgeAsync routes references through
  AcknowledgeAlarmByName + propagates the full operator tuple.
- Executor switch arm carries the by-name tuple and rejects empty
  alarm_name.
- AlarmDispatcher.AcknowledgeByName forwards to consumer.
- Existing fakes extended for the new overload.

Counts: server 308/0, worker 195/3 skip / 1 pre-existing structure-fail
(untouched). Solution builds clean.

End-to-end alarms-over-gateway now serves the full lmxopcua flow:
client.AcknowledgeAlarm(reference="Galaxy!TestArea.TestMachine_001.TestAlarm001",
operator_user="alice") → gateway parses → IPC AcknowledgeAlarmByName →
worker AlarmAckByName → AVEVA history. The remaining piece for full
parity is a live dev-rig smoke test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:17:15 -04:00
Joseph Doherty f711a55be4 A.2: replace AlarmClientConsumer with wnwrap-based polling consumer
Switch the worker's alarm-consumer surface from `aaAlarmManagedClient.AlarmClient`
to `WNWRAPCONSUMERLib.wwAlarmConsumerClass` (CLSID 7AB52E5F-…) hosted by
`wnwrapConsumer.dll`. The new path returns alarm records as a BSTR XML
payload via `GetXmlCurrentAlarms2`, bypassing the FILETIME→DateTime
auto-marshaling that crashed `GetHighPriAlarm` with
ArgumentOutOfRangeException on every poll. Live captured 60/60 polls
clean against `\DESKTOP-6JL3KKO\Galaxy!DEV` while a System Platform
script flipped TestMachine_001.TestAlarm001 every 10s; the GUID,
priority, state (UNACK_ALM ↔ UNACK_RTN), and ASCII-formatted timestamps
arrived end-to-end.

Implementation:
- `Interop.WNWRAPCONSUMERLib.dll` generated via tlbimp, checked in under
  `lib/` so dev boxes don't need the SDK to build.
- New `WnWrapAlarmConsumer` (replaces `AlarmClientConsumer`): owns a
  500ms polling timer, parses `GetXmlCurrentAlarms2` output, diffs the
  snapshot keyed by alarm GUID, and raises one
  `MxAlarmTransitionEvent` per state change. Includes the
  Initialize→Register-before-Subscribe ordering fix found during
  Discovery probe runs.
- New library-agnostic types `MxAlarmSnapshotRecord` /
  `MxAlarmStateKind` / `MxAlarmTransitionEvent` so the proto-build
  path is testable without an AVEVA install.
- `AlarmRecordTransitionMapper` retired the COM-coupled
  `MapTransitionKind(eAlmTransitions)`; new pure helpers
  `ParseStateKind`, `MapTransition(prev, curr)`, and
  `ParseTransitionTimestampUtc` cover XML decode + state-delta logic.
- `IMxAccessAlarmConsumer` event surface changed from
  `EventHandler<AlarmRecord>` to `EventHandler<MxAlarmTransitionEvent>`
  and `SnapshotActiveAlarms()` returns `MxAlarmSnapshotRecord` —
  decoupling the interface from any specific COM library.
- Worker csproj drops `aaAlarmManagedClient` / `IAlarmMgrDataProvider`
  refs; adds `Interop.WNWRAPCONSUMERLib`.

Tests:
- 36 new unit tests (state-string mapping, prev/current → proto kind
  decision table, timestamp UTC reassembly, XML payload parser, 32-char
  hex GUID round-trip) covering everything that doesn't touch the live
  COM surface — all passing.
- Skip-gated `WnWrapConsumerProbeTests.ProbeWnWrapConsumer` archives
  the live capture flow for regression / future probes.

Docs:
- `docs/AlarmClientDiscovery.md` "Option A — captured" section records
  sample XML payloads, the mangled `SetXmlAlarmQuery` round-trip
  (prefer `Subscribe` for filtering), the `GetStatistics`
  AccessViolationException quirk, and the worker-integration outline.

Pre-existing failure noted (separate):
`MxAccessInteropReference_ExistsOnlyInWorkerProject` was already
failing on HEAD — the test project still references `ArchestrA.MxAccess`
for the Skip-gated discovery probes. Not regressed by this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 09:44:15 -04:00