docs: post-PR-7.2 cleanup — audit + three-track scrub
Audit (three parallel agent passes) found 43 markdown files carrying stale references to the deleted Galaxy.Host/Proxy/Shared projects after the v2-mxgw merge. This commit lands the prioritized fixes. Track 1 — high-traffic in-place rewrites (3 files, ~454 lines deleted) - README.md (202 → 91 lines): drops .NET 4.8 / x86 / TopShelf install text; leads with the multi-driver .NET 10 server identity and points at scripts/install/Install-Services.ps1 and the parity rig. - docs/v2/driver-specs.md §1 Galaxy (~289 → ~66 lines): replaces the Tier-C out-of-process spec with a Tier-A in-process description matching the current GalaxyDriver code, with the four-section GalaxyDriverOptions JSON shape pulled verbatim from Config/GalaxyDriverOptions.cs. - docs/drivers/Galaxy.md (211 → 92 lines): full rewrite around the current Browse/Runtime/Health/Config sub-folders. Track 2 — historical banners (5 files) - lmx_mxgw.md, lmx_mxgw_impl.md, lmx_backend.md, docs/v2/Galaxy.ParityMatrix.md, docs/v2/implementation/phase-2-galaxy-out-of-process.md each get a "✅ Completed 2026-04-30 — historical record" banner block. lmx_mxgw.md also fixes two dead links (`docs/Galaxy.Driver.md` and `docs/v2/Galaxy.Driver.md`) → `docs/drivers/Galaxy.md`. Track 3 — v1 archive sweep (10 git mv + 1 new index + 2 in-place scrubs) - Moved 10 v1 docs under docs/v1/ preserving subpath structure: AlarmTracking, Configuration, DataTypeMapping, HistoricalDataAccess, Subscriptions (top-level); drivers/Galaxy-Repository, drivers/Galaxy-Test-Fixture; reqs/GalaxyRepositoryReqs, reqs/MxAccessClientReqs, reqs/ServiceHostReqs. - New docs/v1/README.md is the shared archive banner + per-file table. - docs/README.md repointed to the v1 paths and updated to reflect the v2 two-process deploy shape (Server + Admin + optional OtOpcUaWonderwareHistorian). - docs/v2/Galaxy.ParityRig.md got a historical banner + four inline scrubs marking the OtOpcUaGalaxyHost service / Driver.Galaxy.Host EXE / Driver.Galaxy.ParityTests project as deleted-in-PR-7.2. The repo's live-reading surface (README + CLAUDE.md + docs/v2/) now describes only the post-PR-7.2 architecture. v1 docs are preserved as a labelled archive under docs/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,128 +0,0 @@
|
||||
# Alarm Tracking
|
||||
|
||||
Alarm surfacing is an optional driver capability exposed via `IAlarmSource` (`src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs`). Drivers whose backends have an alarm concept implement it — today: Galaxy (MXAccess alarms), FOCAS (CNC alarms), OPC UA Client (A&C events from the upstream server). Modbus / S7 / AB CIP / AB Legacy / TwinCAT do not implement the interface and the feature is simply absent from their subtrees.
|
||||
|
||||
## IAlarmSource surface
|
||||
|
||||
```csharp
|
||||
Task<IAlarmSubscriptionHandle> SubscribeAlarmsAsync(
|
||||
IReadOnlyList<string> sourceNodeIds, CancellationToken cancellationToken);
|
||||
Task UnsubscribeAlarmsAsync(IAlarmSubscriptionHandle handle, CancellationToken cancellationToken);
|
||||
Task AcknowledgeAsync(IReadOnlyList<AlarmAcknowledgeRequest> acknowledgements,
|
||||
CancellationToken cancellationToken);
|
||||
event EventHandler<AlarmEventArgs>? OnAlarmEvent;
|
||||
```
|
||||
|
||||
The driver fires `OnAlarmEvent` for every transition (`Active`, `Acknowledged`, `Inactive`) with an `AlarmEventArgs` carrying the source node id, condition id, alarm type, message, severity (`AlarmSeverity` enum), and source timestamp.
|
||||
|
||||
## AlarmSurfaceInvoker
|
||||
|
||||
`AlarmSurfaceInvoker` (`src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs`) wraps the three mutating surfaces through `CapabilityInvoker`:
|
||||
|
||||
- `SubscribeAlarmsAsync` / `UnsubscribeAlarmsAsync` run through the `DriverCapability.AlarmSubscribe` pipeline — retries apply under the tier configuration.
|
||||
- `AcknowledgeAsync` runs through `DriverCapability.AlarmAcknowledge` which does NOT retry per decision #143. A timed-out ack may have already registered at the plant floor; replay would silently double-acknowledge.
|
||||
|
||||
Multi-host fan-out: when the driver implements `IPerCallHostResolver`, each source node id is resolved individually and batches are grouped by host so a dead PLC inside a multi-device driver doesn't poison sibling breakers. Single-host drivers fall back to `IDriver.DriverInstanceId` as the pipeline-key host.
|
||||
|
||||
## Condition-node creation via CapturingBuilder
|
||||
|
||||
Alarm-condition nodes are materialized at address-space build time. During `GenericDriverNodeManager.BuildAddressSpaceAsync` the builder is wrapped in a `CapturingBuilder` that observes every `Variable()` call. When a driver calls `IVariableHandle.MarkAsAlarmCondition(AlarmConditionInfo)` on a returned handle, the server-side `DriverNodeManager.VariableHandle` creates a sibling `AlarmConditionState` node and returns an `IAlarmConditionSink`. The wrapper stores the sink in `_alarmSinks` keyed by the variable's full reference, then `GenericDriverNodeManager` registers a forwarder on `IAlarmSource.OnAlarmEvent` that routes each push to the matching sink by `SourceNodeId`. Unknown source ids are dropped silently — they may belong to another driver.
|
||||
|
||||
The `AlarmConditionState` layout matches OPC UA Part 9:
|
||||
|
||||
- `SourceNode` → the originating variable
|
||||
- `SourceName` / `ConditionName` → from `AlarmConditionInfo.SourceName`
|
||||
- Initial state: enabled, inactive, acknowledged, severity per `InitialSeverity`, retain false
|
||||
- `HasCondition` references wire the source variable ↔ the condition node bidirectionally
|
||||
|
||||
Drivers flag alarm-bearing variables at discovery time via `DriverAttributeInfo.IsAlarm = true`. The Galaxy driver, for example, sets this on attributes that have an `AlarmExtension` primitive in the Galaxy repository DB; FOCAS sets it on the CNC alarm register.
|
||||
|
||||
## State transitions
|
||||
|
||||
`ConditionSink.OnTransition` runs under the node manager's `Lock` and maps the `AlarmEventArgs.AlarmType` string to Part 9 state:
|
||||
|
||||
| AlarmType | Action |
|
||||
|---|---|
|
||||
| `Active` | `SetActiveState(true)`, `SetAcknowledgedState(false)`, `Retain = true` |
|
||||
| `Acknowledged` | `SetAcknowledgedState(true)` |
|
||||
| `Inactive` | `SetActiveState(false)`; `Retain = false` once both inactive and acknowledged |
|
||||
|
||||
Severity is remapped: `AlarmSeverity.Low/Medium/High/Critical` → OPC UA numeric 250 / 500 / 700 / 900. `Message.Value` is set from `AlarmEventArgs.Message` on every transition. `ClearChangeMasks(true)` and `ReportEvent(condition)` fire the OPC UA event notification for clients subscribed to any ancestor notifier.
|
||||
|
||||
## Acknowledge dispatch
|
||||
|
||||
Alarm acknowledgement initiated by an OPC UA client flows:
|
||||
|
||||
1. The SDK invokes the `AlarmConditionState.OnAcknowledge` method delegate.
|
||||
2. The handler checks the session's roles for `AlarmAck` — drivers never see a request the session wasn't entitled to make.
|
||||
3. `AlarmSurfaceInvoker.AcknowledgeAsync` is called with the source / condition / comment tuple. The invoker groups by host and runs each batch through the no-retry `AlarmAcknowledge` pipeline.
|
||||
|
||||
Drivers return normally for success or throw to signal the ack failed at the backend.
|
||||
|
||||
## EventNotifier propagation
|
||||
|
||||
Drivers that want hierarchical alarm subscriptions propagate `EventNotifier.SubscribeToEvents` up the containment chain during discovery — the Galaxy driver flips the flag on every ancestor of an alarm-bearing object up to the driver root, mirroring v1 behavior. Clients subscribed at the driver root, a mid-level folder, or the `Objects/` root see alarm events from every descendant with an `AlarmConditionState` sibling. The driver-root `FolderState` is created in `DriverNodeManager.CreateAddressSpace` with `EventNotifier = SubscribeToEvents | HistoryRead` so alarm event subscriptions and alarm history both have a single natural target.
|
||||
|
||||
## ConditionRefresh
|
||||
|
||||
The OPC UA `ConditionRefresh` service queues the current state of every retained condition back to the requesting monitored items. `DriverNodeManager` iterates the node manager's `AlarmConditionState` collection and queues each condition whose `Retain.Value == true` — matching the Part 9 requirement.
|
||||
|
||||
## Alarm historian sink
|
||||
|
||||
Distinct from the live `IAlarmSource` stream and the Part 9 `AlarmConditionState` materialization above, qualifying alarm transitions are **also** persisted to a durable event log for downstream AVEVA Historian ingestion. This is a separate subsystem from the `IHistoryProvider` capability used by `HistoryReadEvents` (see [HistoricalDataAccess.md](HistoricalDataAccess.md#alarm-event-history-vs-ihistoryprovider)): the sink is a *producer* path (server → Historian) that runs independently of any client HistoryRead call.
|
||||
|
||||
### `IAlarmHistorianSink`
|
||||
|
||||
`src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs` defines the intake contract:
|
||||
|
||||
```csharp
|
||||
Task EnqueueAsync(AlarmHistorianEvent evt, CancellationToken cancellationToken);
|
||||
HistorianSinkStatus GetStatus();
|
||||
```
|
||||
|
||||
`EnqueueAsync` is fire-and-forget from the producer's perspective — it must never block the emitting thread. The event payload (`AlarmHistorianEvent` — same file) is source-agnostic: `AlarmId`, `EquipmentPath`, `AlarmName`, `AlarmTypeName` (Part 9 subtype name), `Severity`, `EventKind` (free-form transition string — `Activated` / `Cleared` / `Acknowledged` / `Confirmed` / `Shelved` / …), `Message`, `User`, `Comment`, `TimestampUtc`.
|
||||
|
||||
The sink scope is defined to span every alarm source (plan decision #15: scripted, Galaxy-native, AB CIP ALMD, any future `IAlarmSource`), gated per-alarm by a `HistorizeToAveva` toggle on the producer. Today only `Phase7EngineComposer.RouteToHistorianAsync` (`src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.cs`) is wired — it subscribes to `ScriptedAlarmEngine.OnEvent` and marshals each emission into `AlarmHistorianEvent`. Galaxy-native alarms continue to reach AVEVA Historian via the driver's direct `aahClientManaged` path and do not flow through the sink; the AB CIP ALMD path remains unwired pending a producer-side integration.
|
||||
|
||||
### `SqliteStoreAndForwardSink`
|
||||
|
||||
Default production implementation (`src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs`). A local SQLite queue absorbs every `EnqueueAsync` synchronously; a background `Timer` drains batches asynchronously to an `IAlarmHistorianWriter` so operator actions are never blocked on historian reachability.
|
||||
|
||||
Queue schema (single table `Queue`): `RowId PK autoincrement`, `AlarmId`, `EnqueuedUtc`, `PayloadJson` (serialized `AlarmHistorianEvent`), `AttemptCount`, `LastAttemptUtc`, `LastError`, `DeadLettered` (bool), plus `IX_Queue_Drain (DeadLettered, RowId)`. Default capacity `1_000_000` non-dead-lettered rows; oldest rows evict with a WARN log past the cap.
|
||||
|
||||
Drain cadence: `StartDrainLoop(tickInterval)` arms a periodic timer. `DrainOnceAsync` reads up to `batchSize` rows (default 100) in `RowId` order and forwards them through `IAlarmHistorianWriter.WriteBatchAsync`, which returns one `HistorianWriteOutcome` per row:
|
||||
|
||||
| Outcome | Action |
|
||||
|---|---|
|
||||
| `Ack` | Row deleted. |
|
||||
| `PermanentFail` | Row flipped to `DeadLettered = 1` with reason. Peers in the batch retry independently. |
|
||||
| `RetryPlease` | `AttemptCount` bumped; row stays queued. Drain worker enters `BackingOff`. |
|
||||
|
||||
Writer-side exceptions treat the whole batch as `RetryPlease`.
|
||||
|
||||
Backoff ladder on `RetryPlease` (hard-coded): 1s → 2s → 5s → 15s → 60s cap. Reset to 0 on any batch with no retries. `CurrentBackoff` exposes the current step for instrumentation; the drain timer itself fires on `tickInterval`, so the ladder governs write cadence rather than timer period.
|
||||
|
||||
Dead-letter retention defaults to 30 days (plan decision #21). `PurgeAgedDeadLetters` runs each drain pass and deletes rows whose `LastAttemptUtc` is past the cutoff. `RetryDeadLettered()` is an operator action that clears `DeadLettered` + resets `AttemptCount` on every dead-lettered row so they rejoin the main queue.
|
||||
|
||||
### Composition and writer resolution
|
||||
|
||||
`Phase7Composer.ResolveHistorianSink` (`src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs`) scans the registered drivers for one that implements `IAlarmHistorianWriter`. Today that is `GalaxyProxyDriver` via `GalaxyHistorianWriter` (`src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy/Ipc/GalaxyHistorianWriter.cs`), which forwards batches over the Galaxy.Host pipe to the `aahClientManaged` alarm schema. When a writer is found, a `SqliteStoreAndForwardSink` is instantiated against `%ProgramData%/OtOpcUa/alarm-historian-queue.db` with a 2 s drain tick and the writer attached. When no driver provides a writer the fallback is the DI-registered `NullAlarmHistorianSink` (`src/ZB.MOM.WW.OtOpcUa.Server/Program.cs`), which silently discards and reports `HistorianDrainState.Disabled`.
|
||||
|
||||
### Status and observability
|
||||
|
||||
`GetStatus()` returns `HistorianSinkStatus(QueueDepth, DeadLetterDepth, LastDrainUtc, LastSuccessUtc, LastError, DrainState)` — two `COUNT(*)` scalars plus last-drain telemetry. `DrainState` is one of `Disabled` / `Idle` / `Draining` / `BackingOff`.
|
||||
|
||||
The Admin UI `/alarms/historian` page surfaces this through `HistorianDiagnosticsService` (`src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs`), which also exposes `TryRetryDeadLettered` — it calls through to `SqliteStoreAndForwardSink.RetryDeadLettered` when the live sink is the SQLite implementation and returns 0 otherwise.
|
||||
|
||||
## Key source files
|
||||
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` — capability contract + `AlarmEventArgs`
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Core/Resilience/AlarmSurfaceInvoker.cs` — per-host fan-out + no-retry ack
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Core/OpcUa/GenericDriverNodeManager.cs` — `CapturingBuilder` + alarm forwarder
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs` — `VariableHandle.MarkAsAlarmCondition` + `ConditionSink`
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Backend/Alarms/GalaxyAlarmTracker.cs` — Galaxy-specific alarm-event production
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/IAlarmHistorianSink.cs` — historian sink intake contract + `AlarmHistorianEvent` + `HistorianSinkStatus` + `IAlarmHistorianWriter`
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs` — durable queue + drain worker + backoff ladder + dead-letter retention
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7EngineComposer.cs` — `RouteToHistorianAsync` wires scripted-alarm emissions into the sink
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Server/Phase7/Phase7Composer.cs` — `ResolveHistorianSink` selects `SqliteStoreAndForwardSink` vs `NullAlarmHistorianSink`
|
||||
- `src/ZB.MOM.WW.OtOpcUa.Admin/Services/HistorianDiagnosticsService.cs` — Admin UI `/alarms/historian` status + retry-dead-lettered operator action
|
||||
Reference in New Issue
Block a user