docs(audit): apply per-cluster judgment fixes across living docs
Resolve audit findings: correct WorkerEnvelope proto/route/metric/session facts; rewrite auth (ZB.MOM.WW.Auth migration), dashboard (ZB.MOM.WW.Theme), and StyleGuide (foreign-project copy-paste); document alarm subsystem, Ldap options, and gateway alarm broker; fix client CLI flags and package paths.
This commit is contained in:
+187
-38
@@ -67,9 +67,17 @@ list.
|
||||
|
||||
## What this means
|
||||
|
||||
The architecture comment on
|
||||
`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/AlarmClientConsumer.cs` (PR A.5) is
|
||||
**wrong against this deployed assembly**:
|
||||
> **Historical note (current as built).** This discovery record predates the
|
||||
> as-built alarm path. The `AlarmClientConsumer.cs` file referenced below was
|
||||
> retired; the production consumer is
|
||||
> `src/ZB.MOM.WW.MxGateway.Worker/MxAccess/WnWrapAlarmConsumer.cs` (driven by the
|
||||
> `wwAlarmConsumerClass` COM surface — see [Option A](#option-a--captured-2026-05-01)
|
||||
> below). The current public RPC surface and broker architecture are summarized
|
||||
> in [Current alarm path (as built)](#current-alarm-path-as-built) at the end of
|
||||
> this document; the sections in between are kept as a discovery record.
|
||||
|
||||
The architecture comment on the (now-retired) `AlarmClientConsumer.cs` (PR A.5)
|
||||
was **wrong against this deployed assembly**:
|
||||
|
||||
> "The AVEVA alarm-manager surface (`IAlarmMgrDataProvider`) exposes
|
||||
> the events we need as plain .NET events — no Windows message pump
|
||||
@@ -601,8 +609,14 @@ returned to normal but is unacknowledged — i.e., visible in the
|
||||
"current alarms" list because operator hasn't acked it yet) and
|
||||
`UNACK_ALM` (the alarm is currently active and unacknowledged).
|
||||
The other states from `eAlmState` (`ACK_RTN`, `ACK_ALM`) would
|
||||
appear when an ack is performed — `wwAlarmConsumerClass.AlarmAckByGUID`
|
||||
is the method to call.
|
||||
appear when an ack is performed.
|
||||
|
||||
> **Forward reference / superseded:** an earlier draft named
|
||||
> `wwAlarmConsumerClass.AlarmAckByGUID` as the ack method. That call turned out
|
||||
> to be **`E_NOTIMPL`** on this AVEVA build (see
|
||||
> [`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented)
|
||||
> below). The as-built ack path is the v1 6-arg `AlarmAckByName` on a dedicated
|
||||
> ack-only consumer instance. Do not wire acks through `AlarmAckByGUID`.
|
||||
|
||||
### `GetStatistics` AV — unrelated quirk
|
||||
|
||||
@@ -638,20 +652,25 @@ alarm-consumer surface unblocks A.2 fully. Outline:
|
||||
payload; diff against the previous snapshot (keyed by
|
||||
`GUID`); emit `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
|
||||
events for added/changed/removed records.
|
||||
- `AlarmAckByGUID(VBGUID, comment, oprName, node, domain,
|
||||
fullName)` for client-driven acknowledgements (matches
|
||||
PR A.5's `AlarmAckCommand` payload).
|
||||
- Client-driven acknowledgements. (This draft named `AlarmAckByGUID` and a
|
||||
`AlarmAckCommand` payload; as built the ack proto is
|
||||
`AcknowledgeAlarmCommand` / `AcknowledgeAlarmByNameCommand`, the consumer
|
||||
interface method is `AcknowledgeByGuid` / `AcknowledgeByName`, and the GUID
|
||||
path is `E_NOTIMPL` so only the by-name path runs — see
|
||||
[`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented).)
|
||||
- Lifecycle teardown: `DeregisterConsumer` +
|
||||
`UninitializeConsumer` + `Marshal.FinalReleaseComObject`.
|
||||
3. **Conversion layer:** map XML record fields to
|
||||
`MxAlarmConditionRecord` proto:
|
||||
- `GUID` → `condition_id` (canonicalize the no-dashes hex
|
||||
to a UUID string).
|
||||
- `STATE` enum → `inAlarm` + `acked` booleans
|
||||
(`UNACK_ALM` → in_alarm=true, acked=false;
|
||||
`UNACK_RTN` → in_alarm=false, acked=false;
|
||||
`ACK_ALM` → in_alarm=true, acked=true;
|
||||
`ACK_RTN` → in_alarm=false, acked=true).
|
||||
3. **Conversion layer:** map XML record fields to the alarm proto:
|
||||
- `GUID` and `PROVIDER_NAME!GROUP.TAGNAME` → `alarm_full_reference` (there is
|
||||
no `condition_id` field; the public RPC and worker carry the reference as
|
||||
`alarm_full_reference`, either a canonical GUID or `Provider!Group.Tag`).
|
||||
- `STATE` → `AlarmConditionState` on `ActiveAlarmSnapshot.current_state`
|
||||
(this draft used `inAlarm` + `acked` booleans, which the proto does not
|
||||
have). As built, the snapshot state collapses to three values:
|
||||
`UNACK_ALM` → `Active`; `ACK_ALM` → `ActiveAcked`; `UNACK_RTN` and
|
||||
`ACK_RTN` both → `Inactive` (a returned-to-normal alarm is no longer
|
||||
"active"). For the live `transition` feed the `STATE` instead drives an
|
||||
`AlarmTransitionKind` (`Raise` / `Acknowledge` / `Clear`).
|
||||
- `DATE + TIME + GMTOFFSET + DSTADJUST` → reassemble UTC
|
||||
timestamp; matches the worker's existing `Timestamp`
|
||||
wire format.
|
||||
@@ -663,10 +682,14 @@ alarm-consumer surface unblocks A.2 fully. Outline:
|
||||
`aaAlarmManagedClient`, also true here). The existing
|
||||
`AlarmClientConsumer` skips Initialize entirely; the new
|
||||
`WnWrapAlarmConsumer` includes it from day one.
|
||||
5. **Test reuse:** PR A.5's snapshot/ack contract tests can
|
||||
stay — they don't touch the underlying COM API. Add a new
|
||||
integration test against the wnwrap surface (live-AVEVA-only,
|
||||
Skip-gated like the probe).
|
||||
5. **Test reuse:** the snapshot/ack contract tests stayed — they don't touch
|
||||
the underlying COM API. As built, the alarm tests live under
|
||||
`src/ZB.MOM.WW.MxGateway.Worker.Tests/MxAccess/` (`AlarmDispatcherTests`,
|
||||
`AlarmRecordTransitionMapperTests`, `AlarmCommandHandlerTests`,
|
||||
`AlarmCommandExecutorTests`, `WnWrapAlarmConsumerXmlTests`), with the
|
||||
live-AVEVA-only round-trip in
|
||||
`src/ZB.MOM.WW.MxGateway.Worker.Tests/Probes/AlarmsLiveSmokeTests.cs`
|
||||
(Skip-gated like the probe).
|
||||
|
||||
### Settled API-ordering and surface knowledge
|
||||
|
||||
@@ -752,26 +775,47 @@ AVEVA fixes the v2 method later.
|
||||
The v2 `AlarmAckByGUID(VBGUID, …)` throws `NotImplementedException`
|
||||
(COM `E_NOTIMPL`) on `wwAlarmConsumerClass` against this AVEVA
|
||||
build. The reference→GUID lookup that we initially planned to wire
|
||||
through `AlarmAckByGUID` is therefore not viable on wnwrap; all acks
|
||||
must go through `AlarmAckByName`.
|
||||
through `AlarmAckByGUID` is therefore not viable on wnwrap; only the
|
||||
by-name path actually succeeds.
|
||||
|
||||
The proto `AcknowledgeAlarmCommand` (GUID-based) and the worker's
|
||||
`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain
|
||||
in the codebase for the forward-compat shape, but the gateway-side
|
||||
`WorkerAlarmRpcDispatcher.AcknowledgeAsync` now always routes through
|
||||
`AcknowledgeAlarmByName` when the public RPC supplies a recognizable
|
||||
`Provider!Group.Tag` reference.
|
||||
**Routing as built (and the GUID hazard).** The gateway-side router is
|
||||
`GatewayAlarmMonitor.BuildAcknowledgeCommand` (there is no
|
||||
`WorkerAlarmRpcDispatcher` type). Routing is **conditional on the reference
|
||||
shape**, not unconditional:
|
||||
|
||||
### 5. STA / threading — production fix needed
|
||||
- A reference that `Guid.TryParse` accepts is built into
|
||||
`MxCommandKind.AcknowledgeAlarm` / `AcknowledgeAlarmCommand` — the **GUID
|
||||
path**, which the worker dispatches to `AlarmAckByGUID`.
|
||||
- A `Provider!Group.Tag` reference (parsed by
|
||||
`GatewayAlarmMonitor.TryParseAlarmReference`) is built into
|
||||
`MxCommandKind.AcknowledgeAlarmByName` / `AcknowledgeAlarmByNameCommand` — the
|
||||
by-name path, which is the only one that succeeds on this build.
|
||||
- Anything else fails with an `alarm_full_reference` parse error before any
|
||||
worker call.
|
||||
|
||||
The wnwrap COM is `ThreadingModel=Apartment`. The consumer's
|
||||
internal `Timer` fires on threadpool threads and would block forever
|
||||
on cross-apartment marshaling unless the host STA pumps Win32
|
||||
messages. The smoke test sidesteps this by setting
|
||||
`pollIntervalMilliseconds=0` (Timer disabled) and driving `PollOnce`
|
||||
manually from the test's STA. Production hosting will route polls
|
||||
through the worker's `StaRuntime` in a follow-up — the consumer's
|
||||
`PollOnce` is `public` and idempotent so the wire-up is mechanical.
|
||||
The GUID arm is **still dispatched unguarded**: the proto
|
||||
`AcknowledgeAlarmCommand` and the worker's
|
||||
`MxAccessCommandExecutor.ExecuteAcknowledgeAlarm` switch arm remain in the
|
||||
codebase for forward compatibility, and `BuildAcknowledgeCommand` routes a
|
||||
GUID-shaped reference straight to them. On the deployed wnwrap build that path
|
||||
hits the `E_NOTIMPL` `AlarmAckByGUID` and surfaces a `COMException` rather than
|
||||
acknowledging. **Practical guidance:** acknowledge with the
|
||||
`Provider!Group.Tag` reference (the same form the transition feed emits in
|
||||
`alarm_full_reference`), not a raw GUID, until the GUID arm is either guarded or
|
||||
AVEVA implements `AlarmAckByGUID`.
|
||||
|
||||
### 5. STA / threading
|
||||
|
||||
The wnwrap COM is `ThreadingModel=Apartment`, so every consumer call
|
||||
(`Subscribe`, `PollOnce`, the `AcknowledgeBy*` methods) must run on the STA that
|
||||
created the COM instance. As built, `WnWrapAlarmConsumer` owns **no internal
|
||||
timer and takes no `pollIntervalMilliseconds` parameter** — an earlier draft
|
||||
described a self-driven `Timer` that would have blocked on cross-apartment
|
||||
marshaling, but that design was dropped. Instead `PollOnce()` is a `public`,
|
||||
idempotent method the host drives on the worker's STA (via
|
||||
`StaRuntime.InvokeAsync(() => consumer.PollOnce())`); the poll cadence lives in
|
||||
the host, not the consumer. Each `PollOnce` reads `GetXmlCurrentAlarms2`, diffs
|
||||
against the previous snapshot, and emits transition events.
|
||||
|
||||
### Capture summary
|
||||
|
||||
@@ -790,3 +834,108 @@ Post-ack transition: kind=Clear …
|
||||
|
||||
10s cadence held throughout; full proto fields populated correctly;
|
||||
ack registered server-side without errors.
|
||||
|
||||
## Current alarm path (as built)
|
||||
|
||||
The sections above are a discovery record. This section summarizes the path that
|
||||
actually ships, grounded in the current code. For the proto shapes see
|
||||
[Contracts](./Contracts.md#alarm-rpcs-and-messages); for the server handlers see
|
||||
[gRPC](./Grpc.md); for configuration see
|
||||
[Gateway Configuration](./GatewayConfiguration.md#alarm-options).
|
||||
|
||||
### Public RPCs and configuration
|
||||
|
||||
Alarms are exposed through three **session-less** RPCs on `MxAccessGateway`:
|
||||
`AcknowledgeAlarm`, `StreamAlarms`, and `QueryActiveAlarms`. No client opens a
|
||||
worker session to use them. They are gated by `MxGateway:Alarms:*`:
|
||||
|
||||
- `MxGateway:Alarms:Enabled` (default `false`) turns the whole subsystem on.
|
||||
- `MxGateway:Alarms:SubscriptionExpression` is the canonical
|
||||
`\\<machine>\Galaxy!<area>` subscription; when empty, the monitor falls back
|
||||
to `\\<MachineName>\Galaxy!<DefaultArea>` from `MxGateway:Alarms:DefaultArea`.
|
||||
Enabled with both empty faults the monitor with a configuration diagnostic.
|
||||
- `MxGateway:Alarms:ReconcileIntervalSeconds` (default 30, floored at 5) sets the
|
||||
reconcile cadence below.
|
||||
|
||||
### The always-on `GatewayAlarmMonitor` broker
|
||||
|
||||
`GatewayAlarmMonitor` (`src/ZB.MOM.WW.MxGateway.Server/Alarms/GatewayAlarmMonitor.cs`)
|
||||
is registered by `AddGatewayAlarms` as a singleton, as the `IGatewayAlarmService`,
|
||||
and as a hosted `BackgroundService`. When `Enabled`, it:
|
||||
|
||||
1. Opens **one** gateway-managed worker session dedicated to alarms (client name
|
||||
`gateway-alarm-monitor`, backend `Galaxy`), after a brief startup grace so
|
||||
worker launching and orphan cleanup settle.
|
||||
2. Subscribes that session to the resolved subscription expression and feeds an
|
||||
in-process active-alarm cache (`Dictionary<reference, ActiveAlarmSnapshot>`)
|
||||
from the session's transition events.
|
||||
3. Fans the feed out to **any number** of `StreamAlarms` subscribers — clients
|
||||
never open their own session. The session is transparently re-opened with a
|
||||
5-second backoff if the worker faults.
|
||||
|
||||
### `AlarmFeedMessage` stream protocol
|
||||
|
||||
`StreamAsync` (behind `StreamAlarms`) emits, in order:
|
||||
|
||||
1. one `AlarmFeedMessage { active_alarm }` per currently-cached alarm matching
|
||||
the optional `alarm_filter_prefix`,
|
||||
2. a single `AlarmFeedMessage { snapshot_complete = true }` sentinel,
|
||||
3. then one `AlarmFeedMessage { transition }` per live change.
|
||||
|
||||
The subscriber is registered under the monitor lock **before** the snapshot is
|
||||
taken, so no transition can slip between the snapshot and the live tail.
|
||||
`QueryActiveAlarms` reuses the same cache but emits only the `active_alarm`
|
||||
snapshots and completes — no sentinel, no transitions.
|
||||
|
||||
### Reconcile loop
|
||||
|
||||
A `PeriodicTimer` runs `ReconcileAsync` every
|
||||
`max(5, ReconcileIntervalSeconds)` seconds. It pulls the worker's authoritative
|
||||
active-alarm snapshot and replaces the cache, broadcasting a synthetic `Clear`
|
||||
transition for any cached alarm the snapshot no longer contains and a synthetic
|
||||
`Raise` for any alarm the snapshot adds. This catches transitions the live
|
||||
poll-and-diff feed missed (e.g. across a transport blip). A failed reconcile
|
||||
pass logs at Debug and keeps the current cache.
|
||||
|
||||
### Subscriber backpressure
|
||||
|
||||
Each subscriber gets a bounded channel of **2048** messages
|
||||
(`SubscriberQueueCapacity`). When `Broadcast` cannot write to a subscriber (its
|
||||
channel is full), that subscriber is **completed with an error and dropped** —
|
||||
the error message tells the client to reconnect to re-snapshot. Backpressure
|
||||
from one slow consumer never blocks the broker or other subscribers.
|
||||
|
||||
### Snapshot state collapse
|
||||
|
||||
`ActiveAlarmSnapshot.current_state` carries only three `AlarmConditionState`
|
||||
values, so the four AVEVA `STATE`s collapse: `UNACK_ALM` → `Active`,
|
||||
`ACK_ALM` → `ActiveAcked`, and both `UNACK_RTN` and `ACK_RTN` → `Inactive`
|
||||
(`AlarmDispatcher`). A returned-to-normal alarm is reported as `Inactive` in a
|
||||
snapshot even though it is still listed because it is unacknowledged. The live
|
||||
`transition` feed instead reports `AlarmTransitionKind` (`Raise` / `Acknowledge`
|
||||
/ `Clear`).
|
||||
|
||||
### `alarm_full_reference` parse contract
|
||||
|
||||
`AcknowledgeAlarm` accepts either form in `alarm_full_reference`
|
||||
(`GatewayAlarmMonitor.BuildAcknowledgeCommand`):
|
||||
|
||||
- a canonical GUID (`Guid.TryParse`) → GUID ack path
|
||||
(`AcknowledgeAlarmCommand`), which on the deployed wnwrap build hits the
|
||||
`E_NOTIMPL` `AlarmAckByGUID` — see
|
||||
[`AlarmAckByGUID` is not implemented](#4-alarmackbyguid-is-not-implemented);
|
||||
- a `Provider!Group.Tag` reference (`TryParseAlarmReference`: first `!` splits
|
||||
provider from `Group.Tag`, the first `.` after the `!` splits group from tag)
|
||||
→ by-name ack path (`AcknowledgeAlarmByNameCommand`), the path that works;
|
||||
- anything else → a parse error before any worker call.
|
||||
|
||||
The transition feed emits the `Provider!Group.Tag` form in
|
||||
`alarm_full_reference`, so echoing that value back into `AcknowledgeAlarm` takes
|
||||
the working by-name path.
|
||||
|
||||
### Reserved / unused
|
||||
|
||||
`AlarmTransitionKind.RETRIGGER` is defined in the proto but is **not currently
|
||||
produced** — the transition mapper emits only `Raise` / `Acknowledge` / `Clear`.
|
||||
It is reserved for a future "re-raise from a previously cleared condition"
|
||||
distinction.
|
||||
|
||||
Reference in New Issue
Block a user