docs(audit): apply per-cluster judgment fixes across living docs

Resolve audit findings: correct WorkerEnvelope proto/route/metric/session
facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme),
and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap
options, and gateway alarm broker; fix client CLI flags and package paths.
This commit is contained in:
Joseph Doherty
2026-06-03 16:01:28 -04:00
parent f84e0c3474
commit e541339c07
29 changed files with 1102 additions and 432 deletions
+187 -38
View File
@@ -67,9 +67,17 @@ list.
## What this means
The architecture comment on
`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/AlarmClientConsumer.cs` (PR A.5) is
**wrong against this deployed assembly**:
> **Historical note (current as built).** This discovery record predates the
> as-built alarm path. The `AlarmClientConsumer.cs` file referenced below was
> retired; the production consumer is
> `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs` (driven by the
> `wwAlarmConsumerClass` COM surface — see [Option A](#option-a--captured-2026-05-01)
> below). The current public RPC surface and broker architecture are summarized
> in [Current alarm path (as built)](#current-alarm-path-as-built) at the end of
> this document; the sections in between are kept as a discovery record.
The architecture comment on the (now-retired) `AlarmClientConsumer.cs` (PR A.5)
was **wrong against this deployed assembly**:
> "The AVEVA alarm-manager surface (`IAlarmMgrDataProvider`) exposes
> the events we need as plain .NET events — no Windows message pump
@@ -601,8 +609,14 @@ returned to normal but is unacknowledged — i.e., visible in the
"current alarms" list because operator hasn't acked it yet) and
`UNACK_ALM` (the alarm is currently active and unacknowledged).
The other states from `eAlmState` (`ACK_RTN`, `ACK_ALM`) would
appear when an ack is performed — `wwAlarmConsumerClass.AlarmAckByGUID`
is the method to call.
appear when an ack is performed.
> **Forward reference / superseded:** an earlier draft named
> `wwAlarmConsumerClass.AlarmAckByGUID` as the ack method. That call turned out
> to be **`E_NOTIMPL`** on this AVEVA build (see
> [`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented)
> below). The as-built ack path is the v1 6-arg `AlarmAckByName` on a dedicated
> ack-only consumer instance. Do not wire acks through `AlarmAckByGUID`.
### `GetStatistics` AV — unrelated quirk
@@ -638,20 +652,25 @@ alarm-consumer surface unblocks A.2 fully. Outline:
payload; diff against the previous snapshot (keyed by
`GUID`); emit `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
events for added/changed/removed records.
- `AlarmAckByGUID(VBGUID, comment, oprName, node, domain,
fullName)` for client-driven acknowledgements (matches
PR A.5's `AlarmAckCommand` payload).
- Client-driven acknowledgements. (This draft named `AlarmAckByGUID` and a
`AlarmAckCommand` payload; as built the ack proto is
`AcknowledgeAlarmCommand` / `AcknowledgeAlarmByNameCommand`, the consumer
interface method is `AcknowledgeByGuid` / `AcknowledgeByName`, and the GUID
path is `E_NOTIMPL` so only the by-name path runs — see
[`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented).)
- Lifecycle teardown: `DeregisterConsumer` +
`UninitializeConsumer` + `Marshal.FinalReleaseComObject`.
3. **Conversion layer:** map XML record fields to
`MxAlarmConditionRecord` proto:
- `GUID` → `condition_id` (canonicalize the no-dashes hex
to a UUID string).
- `STATE` enum → `inAlarm` + `acked` booleans
(`UNACK_ALM` → in_alarm=true, acked=false;
`UNACK_RTN` → in_alarm=false, acked=false;
`ACK_ALM` → in_alarm=true, acked=true;
`ACK_RTN` → in_alarm=false, acked=true).
3. **Conversion layer:** map XML record fields to the alarm proto:
- `GUID` and `PROVIDER_NAME!GROUP.TAGNAME` → `alarm_full_reference` (there is
no `condition_id` field; the public RPC and worker carry the reference as
`alarm_full_reference`, either a canonical GUID or `Provider!Group.Tag`).
- `STATE` → `AlarmConditionState` on `ActiveAlarmSnapshot.current_state`
(this draft used `inAlarm` + `acked` booleans, which the proto does not
have). As built, the snapshot state collapses to three values:
`UNACK_ALM` → `Active`; `ACK_ALM` → `ActiveAcked`; `UNACK_RTN` and
`ACK_RTN` both → `Inactive` (a returned-to-normal alarm is no longer
"active"). For the live `transition` feed the `STATE` instead drives an
`AlarmTransitionKind` (`Raise` / `Acknowledge` / `Clear`).
- `DATE + TIME + GMTOFFSET + DSTADJUST` → reassemble UTC
timestamp; matches the worker's existing `Timestamp`
wire format.
@@ -663,10 +682,14 @@ alarm-consumer surface unblocks A.2 fully. Outline:
`aaAlarmManagedClient`, also true here). The existing
`AlarmClientConsumer` skips Initialize entirely; the new
`WnWrapAlarmConsumer` includes it from day one.
5. **Test reuse:** PR A.5's snapshot/ack contract tests can
stay — they don't touch the underlying COM API. Add a new
integration test against the wnwrap surface (live-AVEVA-only,
Skip-gated like the probe).
5. **Test reuse:** the snapshot/ack contract tests stayed — they don't touch
the underlying COM API. As built, the alarm tests live under
`src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/` (`AlarmDispatcherTests`,
`AlarmRecordTransitionMapperTests`, `AlarmCommandHandlerTests`,
`AlarmCommandExecutorTests`, `WnWrapAlarmConsumerXmlTests`), with the
live-AVEVA-only round-trip in
`src/ZB.MOM.WW.MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs`
(Skip-gated like the probe).
### Settled API-ordering and surface knowledge
@@ -752,26 +775,47 @@ AVEVA fixes the v2 method later.
The v2 `AlarmAckByGUID(VBGUID, …)` throws `NotImplementedException`
(COM `E_NOTIMPL`) on `wwAlarmConsumerClass` against this AVEVA
build. The reference→GUID lookup that we initially planned to wire
through `AlarmAckByGUID` is therefore not viable on wnwrap; all acks
must go through `AlarmAckByName`.
through `AlarmAckByGUID` is therefore not viable on wnwrap; only the
by-name path actually succeeds.
The proto `AcknowledgeAlarmCommand` (GUID-based) and the worker's
`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain
in the codebase for the forward-compat shape, but the gateway-side
`WorkerAlarmRpcDispatcher.AcknowledgeAsync` now always routes through
`AcknowledgeAlarmByName` when the public RPC supplies a recognizable
`Provider!Group.Tag` reference.
**Routing as built (and the GUID hazard).** The gateway-side router is
`GatewayAlarmMonitor.BuildAcknowledgeCommand` (there is no
`WorkerAlarmRpcDispatcher` type). Routing is **conditional on the reference
shape**, not unconditional:
### 5. STA / threading — production fix needed
- A reference that `Guid.TryParse` accepts is built into
`MxCommandKind.AcknowledgeAlarm` / `AcknowledgeAlarmCommand` — the **GUID
path**, which the worker dispatches to `AlarmAckByGUID`.
- A `Provider!Group.Tag` reference (parsed by
`GatewayAlarmMonitor.TryParseAlarmReference`) is built into
`MxCommandKind.AcknowledgeAlarmByName` / `AcknowledgeAlarmByNameCommand` — the
by-name path, which is the only one that succeeds on this build.
- Anything else fails with an `alarm_full_reference` parse error before any
worker call.
The wnwrap COM is `ThreadingModel=Apartment`. The consumer's
internal `Timer` fires on threadpool threads and would block forever
on cross-apartment marshaling unless the host STA pumps Win32
messages. The smoke test sidesteps this by setting
`pollIntervalMilliseconds=0` (Timer disabled) and driving `PollOnce`
manually from the test's STA. Production hosting will route polls
through the worker's `StaRuntime` in a follow-up — the consumer's
`PollOnce` is `public` and idempotent so the wire-up is mechanical.
The GUID arm is **still dispatched unguarded**: the proto
`AcknowledgeAlarmCommand` and the worker's
`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain in the
codebase for forward compatibility, and `BuildAcknowledgeCommand` routes a
GUID-shaped reference straight to them. On the deployed wnwrap build that path
hits the `E_NOTIMPL` `AlarmAckByGUID` and surfaces a `COMException` rather than
acknowledging. **Practical guidance:** acknowledge with the
`Provider!Group.Tag` reference (the same form the transition feed emits in
`alarm_full_reference`), not a raw GUID, until the GUID arm is either guarded or
AVEVA implements `AlarmAckByGUID`.
### 5. STA / threading
The wnwrap COM is `ThreadingModel=Apartment`, so every consumer call
(`Subscribe`, `PollOnce`, the `AcknowledgeBy*` methods) must run on the STA that
created the COM instance. As built, `WnWrapAlarmConsumer` owns **no internal
timer and takes no `pollIntervalMilliseconds` parameter** — an earlier draft
described a self-driven `Timer` that would have blocked on cross-apartment
marshaling, but that design was dropped. Instead `PollOnce()` is a `public`,
idempotent method the host drives on the worker's STA (via
`StaRuntime.InvokeAsync(() => consumer.PollOnce())`); the poll cadence lives in
the host, not the consumer. Each `PollOnce` reads `GetXmlCurrentAlarms2`, diffs
against the previous snapshot, and emits transition events.
### Capture summary
@@ -790,3 +834,108 @@ Post-ack transition: kind=Clear …
10s cadence held throughout; full proto fields populated correctly;
ack registered server-side without errors.
## Current alarm path (as built)
The sections above are a discovery record. This section summarizes the path that
actually ships, grounded in the current code. For the proto shapes see
[Contracts](./Contracts.md#alarm-rpcs-and-messages); for the server handlers see
[gRPC](./Grpc.md); for configuration see
[Gateway Configuration](./GatewayConfiguration.md#alarm-options).
### Public RPCs and configuration
Alarms are exposed through three **session-less** RPCs on `MxAccessGateway`:
`AcknowledgeAlarm`, `StreamAlarms`, and `QueryActiveAlarms`. No client opens a
worker session to use them. They are gated by `MxGateway:Alarms:*`:
- `MxGateway:Alarms:Enabled` (default `false`) turns the whole subsystem on.
- `MxGateway:Alarms:SubscriptionExpression` is the canonical
`\\<machine>\Galaxy!<area>` subscription; when empty, the monitor falls back
to `\\<MachineName>\Galaxy!<DefaultArea>` from `MxGateway:Alarms:DefaultArea`.
Enabled with both empty faults the monitor with a configuration diagnostic.
- `MxGateway:Alarms:ReconcileIntervalSeconds` (default 30, floored at 5) sets the
reconcile cadence below.
### The always-on `GatewayAlarmMonitor` broker
`GatewayAlarmMonitor` (`src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs`)
is registered by `AddGatewayAlarms` as a singleton, as the `IGatewayAlarmService`,
and as a hosted `BackgroundService`. When `Enabled`, it:
1. Opens **one** gateway-managed worker session dedicated to alarms (client name
`gateway-alarm-monitor`, backend `Galaxy`), after a brief startup grace so
worker launching and orphan cleanup settle.
2. Subscribes that session to the resolved subscription expression and feeds an
in-process active-alarm cache (`Dictionary<reference, ActiveAlarmSnapshot>`)
from the session's transition events.
3. Fans the feed out to **any number** of `StreamAlarms` subscribers — clients
never open their own session. The session is transparently re-opened with a
5-second backoff if the worker faults.
### `AlarmFeedMessage` stream protocol
`StreamAsync` (behind `StreamAlarms`) emits, in order:
1. one `AlarmFeedMessage { active_alarm }` per currently-cached alarm matching
the optional `alarm_filter_prefix`,
2. a single `AlarmFeedMessage { snapshot_complete = true }` sentinel,
3. then one `AlarmFeedMessage { transition }` per live change.
The subscriber is registered under the monitor lock **before** the snapshot is
taken, so no transition can slip between the snapshot and the live tail.
`QueryActiveAlarms` reuses the same cache but emits only the `active_alarm`
snapshots and completes — no sentinel, no transitions.
### Reconcile loop
A `PeriodicTimer` runs `ReconcileAsync` every
`max(5, ReconcileIntervalSeconds)` seconds. It pulls the worker's authoritative
active-alarm snapshot and replaces the cache, broadcasting a synthetic `Clear`
transition for any cached alarm the snapshot no longer contains and a synthetic
`Raise` for any alarm the snapshot adds. This catches transitions the live
poll-and-diff feed missed (e.g. across a transport blip). A failed reconcile
pass logs at Debug and keeps the current cache.
### Subscriber backpressure
Each subscriber gets a bounded channel of **2048** messages
(`SubscriberQueueCapacity`). When `Broadcast` cannot write to a subscriber (its
channel is full), that subscriber is **completed with an error and dropped** —
the error message tells the client to reconnect to re-snapshot. Backpressure
from one slow consumer never blocks the broker or other subscribers.
### Snapshot state collapse
`ActiveAlarmSnapshot.current_state` carries only three `AlarmConditionState`
values, so the four AVEVA `STATE`s collapse: `UNACK_ALM` → `Active`,
`ACK_ALM` → `ActiveAcked`, and both `UNACK_RTN` and `ACK_RTN` → `Inactive`
(`AlarmDispatcher`). A returned-to-normal alarm is reported as `Inactive` in a
snapshot even though it is still listed because it is unacknowledged. The live
`transition` feed instead reports `AlarmTransitionKind` (`Raise` / `Acknowledge`
/ `Clear`).
### `alarm_full_reference` parse contract
`AcknowledgeAlarm` accepts either form in `alarm_full_reference`
(`GatewayAlarmMonitor.BuildAcknowledgeCommand`):
- a canonical GUID (`Guid.TryParse`) → GUID ack path
(`AcknowledgeAlarmCommand`), which on the deployed wnwrap build hits the
`E_NOTIMPL` `AlarmAckByGUID` — see
[`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented);
- a `Provider!Group.Tag` reference (`TryParseAlarmReference`: first `!` splits
provider from `Group.Tag`, the first `.` after the `!` splits group from tag)
→ by-name ack path (`AcknowledgeAlarmByNameCommand`), the path that works;
- anything else → a parse error before any worker call.
The transition feed emits the `Provider!Group.Tag` form in
`alarm_full_reference`, so echoing that value back into `AcknowledgeAlarm` takes
the working by-name path.
### Reserved / unused
`AlarmTransitionKind.RETRIGGER` is defined in the proto but is **not currently
produced** — the transition mapper emits only `Raise` / `Acknowledge` / `Clear`.
It is reserved for a future "re-raise from a previously cleared condition"
distinction.