A.3 (live smoke): full alarms-over-gateway pipeline verified end-to-end

Skip-gated AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip ran
against the dev rig with the flip script firing
TestMachine_001.TestAlarm001 every 10s. Verified:
  - Subscribe + 1st PollOnce yield real transition events
  - Field-by-field decode correct (provider, group, tag, severity,
    UTC timestamp, comment, type)
  - SnapshotActiveAlarms reflects current state
  - AcknowledgeByName(real identity) -> rc=0
  - Pipeline keeps streaming transitions on the 10s cadence post-ack

Three production quirks surfaced and were fixed in
WnWrapAlarmConsumer:

1. SetXmlAlarmQuery is mandatory for reads. Skipping it (per the
   earlier discovery-doc recommendation) makes the first
   GetXmlCurrentAlarms2 fail with E_FAIL. The doc's claim that the
   call is unnecessary because the round-trip echo is mangled was
   wrong — mangled echo or not, the call is required.

2. SetXmlAlarmQuery breaks AlarmAckByName on the same consumer
   instance (returns -55). Workaround: provision a parallel
   "ack-only" wnwrap consumer that runs Initialize → Register →
   Subscribe via the v1-prefixed methods, no SetXmlAlarmQuery.
   Production WnWrapAlarmConsumer now holds two COM clients;
   AcknowledgeByName always dispatches through the ack-only one.

3. AlarmAckByName has v2 (8-arg) and v1 (6-arg) overloads. The v2
   8-arg overload returns -55 on this AVEVA build (apparently a
   stub); the v1 6-arg overload works. Production now calls the
   6-arg overload, discarding the proto's operator_domain and
   operator_full_name fields. The proto contract keeps both for
   forward-compat if AVEVA fixes the v2 method.

Bonus finding (not fixed here): AlarmAckByGUID throws
NotImplementedException on wnwrap. Reference→GUID lookup that we
initially planned to plumb is therefore not viable; all acks must
go through AlarmAckByName. WorkerAlarmRpcDispatcher.AcknowledgeAsync
already routes references through the by-name path, so this only
affects the GUID-input branch (which the worker tries first if the
input parses as a GUID — that branch will surface
NotImplementedException as MxaccessFailure if a client supplies one).

Threading caveat: wnwrap is ThreadingModel=Apartment, so the
consumer's internal Timer (firing on threadpool threads) blocks on
cross-apartment marshaling without an STA message pump. The smoke
test sidesteps this with pollIntervalMilliseconds=0 (Timer disabled)
+ manual PollOnce calls from the test STA. Production hosting will
route polls through the worker's StaRuntime in a follow-up; PollOnce
is now public so the wire-up is straightforward.

Test counts after this slice:
  Worker: 195 pass / 4 skipped (live probes incl. new live smoke) /
          1 pre-existing structure-fail (untouched)
  Server: 308 pass / 0 fail
Solution builds clean.

docs/AlarmClientDiscovery.md "Live smoke-test discoveries" section
records all five findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-01 12:17:39 -04:00
parent 4e02927f01
commit a4ed605f74
3 changed files with 537 additions and 24 deletions
+102
View File
@@ -688,3 +688,105 @@ alarm-consumer surface unblocks A.2 fully. Outline:
These findings retire the open follow-up probes from the
"polling-vs-pump" debate above — `wwAlarmConsumerClass` plus
poll-on-timer is the implementation.
## Live smoke-test discoveries — 2026-05-01
The Skip-gated `AlarmsLiveSmokeTests.Alarms_full_pipeline_round_trip`
ran the full
`WnWrapAlarmConsumer` + `AlarmDispatcher` + `MxAccessAlarmEventSink`
pipeline against the dev rig with the flip script running. End-to-end
verified: 6 real transitions captured on the 10s cadence, ack-by-name
returned rc=0, pipeline stayed healthy through 5 more transitions
afterwards. Three production-relevant quirks surfaced and were fixed
in the consumer:
### 1. `SetXmlAlarmQuery` is mandatory for reads despite the mangled echo
Without `SetXmlAlarmQuery`, the first `GetXmlCurrentAlarms2` call
fails with `E_FAIL` (HRESULT `0x80004005`). The discovery doc above
flagged the round-trip echo as mangled and recommended skipping the
call — that recommendation is **wrong**. The echo *is* mangled (AVEVA
parses NODE/PROVIDER/ALARM_STATE/DISPLAY_MODE incorrectly), but the
call itself is required as some kind of subscription enabler. Even
the Subscribe call setting the actual filter doesn't avoid the need
for `SetXmlAlarmQuery`.
`WnWrapAlarmConsumer.ComposeXmlAlarmQuery(subscription)` decomposes
the canonical `\\<machine>\Galaxy!<area>` form into the XML's
NODE/PROVIDER/GROUP fields. Mangled or not, the call enables reads.
### 2. Two consumers required: read-side vs. ack-side
`SetXmlAlarmQuery` enables reads but **breaks `AlarmAckByName` on
the same consumer instance**. With SetXml applied, AlarmAckByName
returns -55 even with valid name+provider+group+operator. Without
SetXml, AlarmAckByName succeeds with rc=0.
The production consumer therefore provisions **two** wnwrap COM
instances:
- Primary consumer (`client`): runs full lifecycle including
`SetXmlAlarmQuery` for `GetXmlCurrentAlarms2` polls.
- Ack-only consumer (`ackClient`): runs Initialize → Register →
Subscribe via the v1-prefixed methods, **no SetXmlAlarmQuery**.
All `AcknowledgeByName` calls dispatch through this instance.
Both consumers subscribe to the same expression. Disposal cleans up
both via a shared `ReleaseConsumerCom` helper.
### 3. `AlarmAckByName` v2 8-arg vs. v1 6-arg
`wwAlarmConsumerClass` exposes two `AlarmAckByName` overloads:
- `IwwAlarmConsumer2` v2: 8 args (`name, provider, group, comment,
oprName, node, domainName, oprFullName`).
- `IwwAlarmConsumer` v1: 6 args (no domain, no full-name).
The v2 8-arg method returns -55 on this AVEVA build regardless of
operator-identity inputs — looks like a stub. The v1 6-arg method
works. Production `WnWrapAlarmConsumer.AcknowledgeByName` calls the
6-arg overload and discards the proto's `domain` + `full_name` fields.
The proto contract keeps the 8 fields for forward compatibility if
AVEVA fixes the v2 method later.
### 4. `AlarmAckByGUID` is not implemented
The v2 `AlarmAckByGUID(VBGUID, …)` throws `NotImplementedException`
(COM `E_NOTIMPL`) on `wwAlarmConsumerClass` against this AVEVA
build. The reference→GUID lookup that we initially planned to wire
through `AlarmAckByGUID` is therefore not viable on wnwrap; all acks
must go through `AlarmAckByName`.
The proto `AcknowledgeAlarmCommand` (GUID-based) and the worker's
`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain
in the codebase for the forward-compat shape, but the gateway-side
`WorkerAlarmRpcDispatcher.AcknowledgeAsync` now always routes through
`AcknowledgeAlarmByName` when the public RPC supplies a recognizable
`Provider!Group.Tag` reference.
### 5. STA / threading — production fix needed
The wnwrap COM is `ThreadingModel=Apartment`. The consumer's
internal `Timer` fires on threadpool threads and would block forever
on cross-apartment marshaling unless the host STA pumps Win32
messages. The smoke test sidesteps this by setting
`pollIntervalMilliseconds=0` (Timer disabled) and driving `PollOnce`
manually from the test's STA. Production hosting will route polls
through the worker's `StaRuntime` in a follow-up — the consumer's
`PollOnce` is `public` and idempotent so the wire-up is mechanical.
### Capture summary
```
Transition: kind=Clear ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' …
Transition: kind=Raise ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' …
SnapshotActiveAlarms count=1
active: ref='Galaxy!TestArea.TestMachine_001.TestAlarm001' state=Active
AcknowledgeByName(real identity) -> rc=0
Post-ack transition: kind=Clear …
+1: kind=Raise … (10s after ack)
+2: kind=Clear … (20s)
+3: kind=Raise … (30s)
+4: kind=Clear … (40s)
```
10s cadence held throughout; full proto fields populated correctly;
ack registered server-side without errors.