Compare commits
10 Commits
focas-tier
...
adr-002-dr
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
2a74daf228 | ||
| 3eb5f1d9da | |||
|
|
f2c1cc84e9 | ||
| 8384e58655 | |||
|
|
96940aeb24 | ||
| 340f580be0 | |||
|
|
8d88ffa14d | ||
| 446a5c022c | |||
|
|
5033609944 | ||
| 9034294b77 |
@@ -69,14 +69,32 @@ covers the common address shapes; per-model quirks are not stressed.
|
||||
- Parameter range enforcement (CNC rejects out-of-range writes)
|
||||
- MTB (machine tool builder) custom screens that expose non-standard data
|
||||
|
||||
### 5. Tier-C process isolation behavior
|
||||
### 5. Tier-C process isolation — architecture shipped, Fwlib32 integration hardware-gated
|
||||
|
||||
Per driver-stability.md, FOCAS should run process-isolated because
|
||||
`Fwlib32.dll` has documented crash modes. The test suite runs in-process +
|
||||
only exercises the happy path + mapped error codes — a native access
|
||||
violation from the DLL would take the test host down. The process-isolation
|
||||
path (similar to Galaxy's out-of-process Host) has been scoped but not
|
||||
implemented.
|
||||
The Tier-C architecture is now in place as of PRs #169–#173 (FOCAS
|
||||
PR A–E, task #220):
|
||||
|
||||
- `Driver.FOCAS.Shared` carries MessagePack IPC contracts
|
||||
- `Driver.FOCAS.Host` (.NET 4.8 x86 Windows service via NSSM) accepts
|
||||
a connection on a strictly-ACL'd named pipe + dispatches frames to
|
||||
an `IFocasBackend`
|
||||
- `Driver.FOCAS.Ipc.IpcFocasClient` implements the `IFocasClient` DI
|
||||
seam by forwarding over IPC — swap the DI registration and the
|
||||
driver runs Tier-C with zero other changes
|
||||
- `Driver.FOCAS.Supervisor.FocasHostSupervisor` owns the spawn +
|
||||
heartbeat + respawn + 3-in-5min crash-loop breaker + sticky alert
|
||||
- `Driver.FOCAS.Host.Stability.PostMortemMmf` ↔
|
||||
`Driver.FOCAS.Supervisor.PostMortemReader` — ring-buffer of the
|
||||
last ~1000 IPC operations survives a Host crash
|
||||
|
||||
The one remaining gap is the production `FwlibHostedBackend`: an
|
||||
`IFocasBackend` implementation that wraps the licensed
|
||||
`Fwlib32.dll` P/Invoke. That's hardware-gated on task #222 — we
|
||||
need a CNC on the bench (or the licensed FANUC developer kit DLL
|
||||
with a test harness) to validate it. Until then, the Host ships
|
||||
`FakeFocasBackend` + `UnconfiguredFocasBackend`. Setting
|
||||
`OTOPCUA_FOCAS_BACKEND=fake` lets operators smoke-test the whole
|
||||
Tier-C pipeline end-to-end without any CNC.
|
||||
|
||||
## When to trust FOCAS tests, when to reach for a rig
|
||||
|
||||
|
||||
@@ -34,7 +34,8 @@ shaped (neither is a Modbus-side concept).
|
||||
- `DL205SmokeTests` — FC16 write → FC03 read round-trip on holding register
|
||||
- `DL205CoilMappingTests` — Y-output / C-relay / X-input address mapping
|
||||
(octal → Modbus offset)
|
||||
- `DL205ExceptionCodeTests` — Modbus exception → OPC UA StatusCode mapping
|
||||
- `DL205ExceptionCodeTests` — Modbus exception 0x02 → OPC UA `BadOutOfRange` against the dl205 profile (natural out-of-range path)
|
||||
- `ExceptionInjectionTests` — every other exception code in the mapping table (0x01 / 0x03 / 0x04 / 0x05 / 0x06 / 0x0A / 0x0B) against the `exception_injection` profile on both read + write paths
|
||||
- `DL205FloatCdabQuirkTests` — CDAB word-swap float encoding
|
||||
- `DL205StringQuirkTests` — packed-string V-memory layout
|
||||
- `DL205VMemoryQuirkTests` — V-memory octal addressing
|
||||
@@ -103,8 +104,13 @@ Not a Modbus concept. Driver doesn't implement `IAlarmSource` or
|
||||
|
||||
1. Add `MODBUS_SIM_ENDPOINT` override documentation to
|
||||
`docs/v2/test-data-sources.md` so operators can point the suite at a lab rig.
|
||||
2. Extend `pymodbus` profiles to inject exception responses — a JSON flag per
|
||||
register saying "next read returns exception 0x04."
|
||||
2. ~~Extend `pymodbus` profiles to inject exception responses~~ — **shipped**
|
||||
via the `exception_injection` compose profile + standalone
|
||||
`exception_injector.py` server. Rules in
|
||||
`Docker/profiles/exception_injection.json` map `(fc, address)` to an
|
||||
exception code; `ExceptionInjectionTests` exercises every code in
|
||||
`MapModbusExceptionToStatus` (0x01 / 0x02 / 0x03 / 0x04 / 0x05 / 0x06 /
|
||||
0x0A / 0x0B) end-to-end on both read (FC03) and write (FC06) paths.
|
||||
3. Add an FX5U profile once a lab rig is available; the scaffolding is in place.
|
||||
|
||||
## Key fixture / config files
|
||||
|
||||
136
docs/v2/implementation/adr-002-driver-vs-virtual-dispatch.md
Normal file
136
docs/v2/implementation/adr-002-driver-vs-virtual-dispatch.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# ADR-002 — Driver-vs-virtual dispatch: how `DriverNodeManager` routes reads, writes, and subscriptions across driver tags and virtual (scripted) tags
|
||||
|
||||
**Status:** Accepted 2026-04-20 — Option B (single NodeManager + NodeSource tag on the resolver output); Options A and C explicitly rejected.
|
||||
|
||||
**Related phase:** [Phase 7 — Scripting Runtime + Scripted Alarms](phase-7-scripting-and-alarming.md) Stream G.
|
||||
|
||||
**Related tasks:** #237 Phase 7 Stream G — Address-space integration.
|
||||
|
||||
**Related ADRs:** [ADR-001 — Equipment node walker](adr-001-equipment-node-walker.md) (this ADR extends the walker + resolver it established).
|
||||
|
||||
## Context
|
||||
|
||||
Phase 7 introduces **virtual tags** — OPC UA variables whose values are computed by user-authored C# scripts against other tags (driver or virtual). Per design decision #2 in the Phase 7 plan, virtual tags **live in the Equipment tree alongside driver tags** (not a separate `/Virtual/...` namespace). An operator browsing `Enterprise/Site/Area/Line/Equipment/` sees a flat list of children that includes both driver-sourced variables (e.g. `SpeedSetpoint` coming from a Modbus tag) and virtual variables (e.g. `LineRate` computed from `SpeedSetpoint × 0.95`).
|
||||
|
||||
From the operator's perspective there is no difference. From the server's perspective there is a big one: a read / write / subscribe on a driver node must dispatch to a driver's `IReadable` / `IWritable` / `ISubscribable` implementation; the same operation on a virtual node must dispatch to the `VirtualTagEngine`. The existing `DriverNodeManager` (shipped in Phase 1, extended by ADR-001) only knows about the driver case today.
|
||||
|
||||
The question is how the dispatch should branch. Three options considered.
|
||||
|
||||
## Options
|
||||
|
||||
### Option A — A separate `VirtualTagNodeManager` sibling to `DriverNodeManager`
|
||||
|
||||
Register a second `INodeManager` with the OPC UA stack dedicated to virtual-tag nodes. Each tag landed under an Equipment folder would be owned by whichever NodeManager materialized it; mixed folders would have children belonging to two different managers.
|
||||
|
||||
**Pros:**
|
||||
- Clean separation — virtual-tag code never touches driver code paths.
|
||||
- Independent lifecycle: restart the virtual-tag engine without touching drivers.
|
||||
|
||||
**Cons:**
|
||||
- ADR-001's `EquipmentNodeWalker` was designed as a single walker producing a single tree under one NodeManager. Forking into two walkers (one per source) risks the UNS / Equipment folders existing twice (once per manager) with different child sets, and the OPC UA stack treating them as distinct nodes.
|
||||
- Mixed equipment folders: when a Line has 3 driver tags + 2 virtual tags, a client browsing the Line folder expects to see 5 children. Two NodeManagers each claiming ownership of the same folder adds the browse-merge problem the stack doesn't do cleanly.
|
||||
- ACL binding (Phase 6.2 trie): one scope per Equipment folder, resolved by `NodeScopeResolver`. Two NodeManagers means two resolution paths or shared resolution logic — cross-manager coupling that defeats the separation.
|
||||
- Audit pathways (Phase 6.2 `IAuditLogger`) and resilience wrappers (Phase 6.1 `CapabilityInvoker`) are wired into the existing `DriverNodeManager`. Duplicating them into a second manager doubles the surface that the Roslyn analyzer from Phase 6.1 Stream A follow-up must keep honest.
|
||||
|
||||
**Rejected** because the sharing of folders (Equipment nodes owning both kinds of children) is the common case, not the exception. Two NodeManagers would fight for ownership on every Equipment node.
|
||||
|
||||
### Option B — Single `DriverNodeManager`, `NodeScopeResolver` returns a `NodeSource` tag, dispatch branches on source
|
||||
|
||||
`NodeScopeResolver` (established in ADR-001) already joins nodes against the config DB to produce a `ScopeId` for ACL enforcement. Extend it to **also return a `NodeSource` enum** (`Driver` or `Virtual`). `DriverNodeManager` dispatch methods check the source and route:
|
||||
|
||||
```csharp
|
||||
internal sealed class DriverNodeManager : CustomNodeManager2
|
||||
{
|
||||
private readonly IReadOnlyDictionary<string, IDriver> _drivers;
|
||||
private readonly IVirtualTagEngine _virtualTagEngine;
|
||||
private readonly NodeScopeResolver _resolver;
|
||||
|
||||
protected override async Task ReadValueAsync(NodeId nodeId, ...)
|
||||
{
|
||||
var scope = _resolver.Resolve(nodeId);
|
||||
// ... ACL check via Phase 6.2 trie (unchanged)
|
||||
return scope.Source switch
|
||||
{
|
||||
NodeSource.Driver => await _drivers[scope.DriverInstanceId].ReadAsync(...),
|
||||
NodeSource.Virtual => await _virtualTagEngine.ReadAsync(scope.VirtualTagId, ...),
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Single address-space tree. `EquipmentNodeWalker` emits one folder per Equipment node and hangs both driver and virtual children under it. Browse / subscribe fan-out / ACL resolution all happen in one NodeManager with one mental model.
|
||||
- ACL binding works identically for both kinds. A user with `ReadEquipment` on `Line1/Pump_7` can read every child, driver-sourced or virtual.
|
||||
- Phase 6.1 resilience wrapping + Phase 6.2 audit logging apply uniformly. The `CapabilityInvoker` analyzer stays correct without new exemptions.
|
||||
- Adding future source kinds (e.g. a "derived tag" that's neither a driver read nor a script evaluation) is a single-enum-case addition — no new NodeManager.
|
||||
|
||||
**Cons:**
|
||||
- `NodeScopeResolver` becomes slightly chunkier — it now carries dispatch metadata in addition to ACL scope. We own that complexity; the payoff is one tree, one lifecycle.
|
||||
- A bug in the dispatch branch could leak a driver call into the virtual path or vice versa. Mitigated by an xUnit theory in Stream G.4 that mixes both kinds in one Equipment folder and asserts each routes correctly.
|
||||
|
||||
**Accepted.**
|
||||
|
||||
### Option C — Virtual tag engine registers as a synthetic `IDriver`
|
||||
|
||||
Implement a `VirtualTagDriverAdapter` that wraps `VirtualTagEngine` and registers it alongside real drivers through the existing `DriverTypeRegistry`. Then `DriverNodeManager` dispatches everything through driver plumbing — virtual tags are just "a driver with no wire."
|
||||
|
||||
**Pros:**
|
||||
- Reuses every existing `IDriver` pathway without modification.
|
||||
- Dispatch branch is trivial because there's no branch — everything routes through driver plumbing.
|
||||
|
||||
**Cons:**
|
||||
- `DriverInstance` is the wrong shape for virtual-tag config: no `DriverType`, no `HostAddress`, no connectivity probe, no lifecycle-initialization parameters, no NSSM wrapper. Forcing it to fit means adding null columns / sentinel values everywhere.
|
||||
- `IDriver.InitializeAsync` / `IRediscoverable` semantics don't match a scripting engine — the engine doesn't "discover" tags against a wire, it compiles scripts against a config snapshot.
|
||||
- The resilience Polly wrappers are calibrated for network-bound calls (timeout / retry / circuit breaker). Applying them to a script evaluation is either a pointless passthrough or wrong tuning.
|
||||
- The Admin UI would need special-casing in every driver-config screen to hide fields that don't apply. The shape mismatch leaks everywhere.
|
||||
|
||||
**Rejected** because the fit is worse than Option B's lightweight dispatch branch. The pretense of uniformity would cost more than the branch it avoids.
|
||||
|
||||
## Decision
|
||||
|
||||
**Option B is accepted.**
|
||||
|
||||
`NodeScopeResolver.Resolve(nodeId)` returns a `NodeScope` record with:
|
||||
|
||||
```csharp
|
||||
public sealed record NodeScope(
|
||||
string ScopeId, // ACL scope ID — unchanged from ADR-001
|
||||
NodeSource Source, // NEW: Driver or Virtual
|
||||
string? DriverInstanceId, // populated when Source=Driver
|
||||
string? VirtualTagId); // populated when Source=Virtual
|
||||
|
||||
public enum NodeSource
|
||||
{
|
||||
Driver,
|
||||
Virtual,
|
||||
}
|
||||
```
|
||||
|
||||
`DriverNodeManager` holds a single reference to `IVirtualTagEngine` alongside its driver dictionary. Read / Write / Subscribe dispatch pattern-matches on `scope.Source` and routes accordingly. Writes to a virtual node from an OPC UA client return `BadUserAccessDenied` because per Phase 7 decision #6, virtual tags are writable **only** from scripts via `ctx.SetVirtualTag`. That check lives in `DriverNodeManager` before the dispatch branch — a dedicated ACL rule rather than a capability of the engine.
|
||||
|
||||
Dispatch tests (Phase 7 Stream G.4) must cover at minimum:
|
||||
- Mixed Equipment folder (driver + virtual children) browses with all children visible
|
||||
- Read routes to the correct backend for each source kind
|
||||
- Subscribe delivers changes from both kinds on the same subscription
|
||||
- OPC UA client write to a virtual node returns `BadUserAccessDenied` without invoking the engine
|
||||
- Script-driven write to a virtual node (via `ctx.SetVirtualTag`) updates the value + fires subscription notifications
|
||||
|
||||
## Consequences
|
||||
|
||||
- `EquipmentNodeWalker` (ADR-001) gains an extra input channel: the config DB's `VirtualTag` table alongside the existing `Tag` table. Walker emits both kinds of children under each Equipment folder with the `NodeSource` tag set per row.
|
||||
- `NodeScopeResolver` gains a `NodeSource` return value. The change is additive (ADR-001's `ScopeId` field is unchanged), so Phase 6.2's ACL trie keeps working without modification.
|
||||
- `DriverNodeManager` gains a dispatch branch but the shape of every `I*` call into drivers is unchanged. Phase 6.1's resilience wrapping applies identically to the driver branch; the virtual branch wraps separately (virtual tag evaluation errors map to `BadInternalError` per Phase 7 decision #11, not through the Polly pipeline).
|
||||
- Adding a future source kind (e.g. an alias tag, a cross-cluster federation tag) is one enum case + one dispatch arm + the equivalent walker extension. The architecture is extensible without rewrite.
|
||||
|
||||
## Not Decided (revisitable)
|
||||
|
||||
- **Whether `IVirtualTagEngine` should live alongside `IDriver` in `Core.Abstractions` or stay in the Phase 7 project.** Plan currently keeps it in Phase 7's `Core.VirtualTags` project because it's not a driver capability. If Phase 7 Stream G discovers significant shared surface, promote later — not blocking.
|
||||
- **Whether server-side method calls from OPC UA clients (e.g. a future "force-recompute-this-virtual-tag" admin method) should route through the same dispatch.** Out of scope — virtual tags have no method nodes today; scripted alarm method calls (`OneShotShelve` etc.) route through their own `ScriptedAlarmEngine` path per Phase 7 Stream C.6.
|
||||
|
||||
## References
|
||||
|
||||
- [Phase 7 — Scripting Runtime + Scripted Alarms](phase-7-scripting-and-alarming.md) Stream G
|
||||
- [ADR-001 — Equipment node walker](adr-001-equipment-node-walker.md)
|
||||
- [`docs/v2/plan.md`](../plan.md) decision #110 (Tag-to-Equipment binding)
|
||||
- [`docs/v2/plan.md`](../plan.md) decision #120 (UNS hierarchy requirements)
|
||||
- Phase 6.2 `NodeScopeResolver` ACL join
|
||||
@@ -1,12 +1,13 @@
|
||||
# FOCAS Tier-C isolation — plan for task #220
|
||||
|
||||
> **Status**: DRAFT — not yet started. Tracks the multi-PR work to
|
||||
> move `Fwlib32.dll` behind an out-of-process host, mirroring the
|
||||
> Galaxy Tier-C split in [`phase-2-galaxy-out-of-process.md`](phase-2-galaxy-out-of-process.md).
|
||||
> **Status**: PRs A–E shipped. Architecture is in place; the only
|
||||
> remaining FOCAS work is the hardware-dependent production
|
||||
> integration of `Fwlib32.dll` into a real `IFocasBackend`
|
||||
> (`FwlibHostedBackend`), which needs an actual CNC on the bench
|
||||
> and is tracked as a follow-up on #220.
|
||||
>
|
||||
> **Pre-reqs shipped** (this PR): version matrix + pre-flight
|
||||
> validation + unit tests. Those close the cheap half of the
|
||||
> hardware-free stability gap. Tier-C closes the expensive half.
|
||||
> **Pre-reqs shipped**: version matrix + pre-flight validation
|
||||
> (PR #168 — the cheap half of the hardware-free stability gap).
|
||||
|
||||
## Why isolate
|
||||
|
||||
@@ -79,32 +80,41 @@ its own timer + pushes change notifications so the Proxy doesn't
|
||||
round-trip per poll. Matches `Driver.Galaxy.Host` subscription
|
||||
forwarding.
|
||||
|
||||
## PR sequence (proposed)
|
||||
## PR sequence — shipped
|
||||
|
||||
1. **PR A — shared contracts**
|
||||
Create `Driver.FOCAS.Shared` with the MessagePack DTOs. No
|
||||
behaviour change. ~200 LOC + round-trip tests for each DTO.
|
||||
2. **PR B — Host project skeleton**
|
||||
Create `Driver.FOCAS.Host` .NET 4.8 x86 project, NSSM wrapper,
|
||||
pipe server scaffold with the same ACL + caller-SID + shared
|
||||
secret plumbing as Galaxy.Host. No Fwlib32 wiring yet — returns
|
||||
`NotImplemented` for everything. ~400 LOC.
|
||||
3. **PR C — Move Fwlib32 calls into Host**
|
||||
Move `FocasNativeSession`, `FocasTagReader`, `FocasTagWriter`,
|
||||
`FocasPmcBitRmw` + the STA thread into the Host. Proxy forwards
|
||||
over IPC. This is the biggest PR — probably 800-1500 LOC of
|
||||
move-with-translation. Existing unit tests keep passing because
|
||||
`IFocasTagFactory` is the DI seam the tests inject against.
|
||||
4. **PR D — Supervisor + respawn**
|
||||
Proxy-side heartbeat + respawn + crash-loop circuit breaker +
|
||||
BackPressure fan-out on Host death. ~500 LOC + chaos tests.
|
||||
5. **PR E — Post-mortem MMF + operational glue**
|
||||
MMF writer in Host, reader in Proxy. Install scripts for the
|
||||
new `OtOpcUaFocasHost` Windows service. Docs. ~300 LOC.
|
||||
1. **PR A (#169) — shared contracts** ✅
|
||||
`Driver.FOCAS.Shared` netstandard2.0 with MessagePack DTOs for every
|
||||
IPC surface (Hello/Heartbeat/OpenSession/Read/Write/PmcBitWrite/
|
||||
Subscribe/Probe/RuntimeStatus/Recycle/ErrorResponse) + FrameReader/
|
||||
FrameWriter + 24 round-trip tests.
|
||||
2. **PR B (#170) — Host project skeleton** ✅
|
||||
`Driver.FOCAS.Host` net48 x86 Windows Service entry point,
|
||||
`PipeAcl` + `PipeServer` + `IFrameHandler` + `StubFrameHandler`.
|
||||
ACL denies LocalSystem/Administrators; Hello verifies
|
||||
shared-secret + protocol major. 3 handshake tests.
|
||||
3. **PR C (#171) — IPC path end-to-end** ✅
|
||||
Proxy `Ipc/FocasIpcClient` + `Ipc/IpcFocasClient` (implements
|
||||
IFocasClient via IPC). Host `Backend/IFocasBackend` +
|
||||
`FakeFocasBackend` + `UnconfiguredFocasBackend` +
|
||||
`Ipc/FwlibFrameHandler` replacing the stub. 13 new round-trip
|
||||
tests via in-memory loopback.
|
||||
4. **PR D (#172) — Supervisor + respawn** ✅
|
||||
`Supervisor/Backoff` (5s→15s→60s) + `CircuitBreaker` (3-in-5min →
|
||||
1h→4h→manual) + `HeartbeatMonitor` + `IHostProcessLauncher` +
|
||||
`FocasHostSupervisor`. 14 tests.
|
||||
5. **PR E — Ops glue** ✅ (this PR)
|
||||
`ProcessHostLauncher` (real Process.Start + FocasIpcClient
|
||||
connect), `Host/Stability/PostMortemMmf` (magic 'OFPC') +
|
||||
Proxy `Supervisor/PostMortemReader`, `scripts/install/
|
||||
Install-FocasHost.ps1` + `Uninstall-FocasHost.ps1` NSSM wrappers.
|
||||
7 tests (4 MMF round-trip + 3 reader format compatibility).
|
||||
|
||||
Total estimate: 2200-3200 LOC across 5 PRs. Consistent with Galaxy
|
||||
Tier-C but narrower since FOCAS has no Historian + no alarm
|
||||
history.
|
||||
**Post-shipment totals: 189 FOCAS driver tests + 24 Shared tests + 13 Host tests = 226 FOCAS-family tests green.**
|
||||
|
||||
What remains is hardware-dependent: wiring `Fwlib32.dll` P/Invoke
|
||||
into a real `FwlibHostedBackend` implementation of `IFocasBackend`
|
||||
+ validating against a live CNC. The architecture is all the
|
||||
plumbing that work needs.
|
||||
|
||||
## Testing without hardware
|
||||
|
||||
|
||||
190
docs/v2/implementation/phase-7-scripting-and-alarming.md
Normal file
190
docs/v2/implementation/phase-7-scripting-and-alarming.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Phase 7 — Scripting Runtime, Virtual Tags, and Scripted Alarms
|
||||
|
||||
> **Status**: DRAFT — planning output from the 2026-04-20 interactive planning session. Pending review before work begins. Task #230 tracks the draft; #231–#238 are the stream placeholders.
|
||||
>
|
||||
> **Branch**: `v2/phase-7-scripting-and-alarming`
|
||||
> **Estimated duration**: 10–12 weeks (scope-comparable to Phase 6; largest single phase outside Phase 2 Galaxy split)
|
||||
> **Predecessor**: Phase 6.4 (Admin UI completion) — reuses the tab-plugin pattern + draft/publish flow
|
||||
> **Successor**: v2 release-readiness capstone
|
||||
|
||||
## Phase Objective
|
||||
|
||||
Add two **additive** runtime capabilities on top of the existing driver + Equipment address-space foundation:
|
||||
|
||||
1. **Virtual (calculated) tags** — OPC UA variables whose values are computed by user-authored C# scripts against other tags (driver or virtual), evaluated on change and/or timer. They live in the existing Equipment/UNS tree alongside driver tags and behave identically to clients (browse, subscribe, historize).
|
||||
2. **Scripted alarms** — OPC UA Part 9 alarms whose condition is a user-authored C# predicate. Full state machine (EnabledState / ActiveState / AckedState / ConfirmedState / ShelvingState) with persistent operator-supplied state across restarts. Complement the existing Galaxy-native and AB CIP ALMD alarm sources — they do not replace them.
|
||||
|
||||
Tie-in capability — **historian alarm sink**:
|
||||
|
||||
3. **Aveva Historian as alarm system of record** — every qualifying alarm transition (activation, ack, confirm, clear, shelve, disable, comment) from **any `IAlarmSource`** (scripted + Galaxy + ALMD) routes through a new local SQLite store-and-forward queue to Galaxy.Host, which uses its already-loaded `aahClientManaged` DLLs to write to the Historian's alarm schema. Per-alarm `HistorizeToAveva` toggle gates which sources flow (default off for Galaxy-native since Galaxy itself already historizes them). Plant operators query one uniform historical alarm timeline.
|
||||
|
||||
**Why it's additive, not a rewrite**: every `IAlarmSource` implementation shipped in Phase 6.x stays unchanged; scripted alarms register as an additional source in the existing fan-out. The Equipment node walker built in ADR-001 gains a "virtual" source kind alongside "driver" without removing anything. Operator-facing semantics for existing driver tags and alarms are unchanged.
|
||||
|
||||
## Design Decisions (locked in the 2026-04-20 planning session)
|
||||
|
||||
| # | Decision | Rationale |
|
||||
|---|---------|-----------|
|
||||
| 1 | Script language = **C# via Roslyn scripting** | Developer audience, strong typing, AST walkable for dependency inference, existing .NET 10 runtime in main server. |
|
||||
| 2 | Virtual tags live in the **Equipment tree** alongside driver tags (not a separate `/Virtual/...` namespace) | Operator mental model stays unified; calculated `LineRate` shows up under the Line1 folder next to the driver-sourced `SpeedSetpoint` it's derived from. |
|
||||
| 3 | Evaluation trigger = **change-driven + timer-driven**; operator chooses per-tag | Change-driven is cheap at steady state; timer is the escape hatch for polling derivations that don't have a discrete "input changed" signal. |
|
||||
| 4 | Script shape = **Shape A — one script per virtual tag/alarm**; `return` produces the value (or `bool` for alarm condition) | Minimal surface; no predicate/action split. Alarm side-effects (severity, message) configured out-of-band, not in the script. |
|
||||
| 5 | Alarm fidelity = **full OPC UA Part 9** | Uniform with Galaxy + ALMD on the wire; client-side tooling (HMIs, historians, event pipelines) gets one shape. |
|
||||
| 6 | Sandbox = **read-only context**; scripts can only read any tag + write to virtual tags | Strict Roslyn `ScriptOptions` allow-list. No HttpClient / File / Process / reflection. |
|
||||
| 7 | Dependency declaration = **AST inference**; operator doesn't maintain a separate dependency list | `CSharpSyntaxWalker` extracts `ctx.GetTag("path")` string-literal calls at compile time; dynamic paths rejected at publish. |
|
||||
| 8 | Config storage = **config DB with generation-sealed cache** (same as driver instances) | Virtual tags + alarms publish atomically in the same generation as the driver instance config they may depend on. |
|
||||
| 9 | Script return value shape (`ctx.GetTag`) = **`DataValue { Value, StatusCode, Timestamp }`** | Scripts branch on quality naturally without separate `ctx.GetQuality(...)` calls. |
|
||||
| 10 | Historize virtual tags = **per-tag checkbox** | Writes flow through the same history-write path as driver tags. Consumed by existing `IHistoryProvider`. |
|
||||
| 11 | Per-tag error isolation — a throwing script sets that tag's quality to `BadInternalError`; engine keeps running for every other tag | Mirrors Phase 6.1 Stream B's per-surface error handling. |
|
||||
| 12 | Dedicated Serilog sink = `scripts-*.log` rolling file; structured-property `ScriptName` for filtering | Keeps noisy script logs out of the main `opcua-*.log`. `ctx.Logger.Info/Warning/Error/Debug` bound in the script context. |
|
||||
| 13 | Alarm message = **template with substitution** (`"Reactor temp {Reactor/Temp} exceeded {Limit}"`) | Middle ground between static and separate message-script; engine resolves `{path}` tokens at event emission. |
|
||||
| 14 | Alarm state persistence — `ActiveState` recomputed from tag values on startup; `EnabledState / AckedState / ConfirmedState / ShelvingState` + audit trail persist to config DB | Operators don't re-ack after restart; ack history survives for compliance (GxP / 21 CFR Part 11). |
|
||||
| 15 | Historian sink scope = **all `IAlarmSource` implementations**, not just scripted; per-alarm `HistorizeToAveva` toggle | Plant gets one consolidated alarm timeline; Galaxy-native alarms default off to avoid duplication. |
|
||||
| 16 | Historian failure mode = **SQLite store-and-forward queue on the node**; config DB is source of truth, Historian is best-effort projection | Operators never blocked by Historian downtime; failed writes queue + retry when Historian recovers. |
|
||||
| 17 | Historian ingestion path = **IPC to Galaxy.Host**, which calls the already-loaded `aahClientManaged` DLLs | Reuses existing bitness / licensing / Tier-C isolation. No new 32-bit DLL load in the main server. |
|
||||
| 18 | Admin UI code editor = **Monaco** via the Admin project's asset pipeline | Industry default for C# editing in a browser; ~3 MB bundle acceptable given Admin is operator-facing only, not public. Revisitable if bundle size becomes a deployment constraint. |
|
||||
| 19 | Cascade evaluation order = **serial** for v1; parallel promoted to a Phase 7 follow-up | Deterministic, easier to reason about, simplifies cycle + ordering bugs in the rollout. Parallel becomes a tuning knob when real 1000+ virtual-tag deployments measure contention. |
|
||||
| 20 | Shelving UX = **OPC UA method calls only** (`OneShotShelve` / `TimedShelve` / `Unshelve` on the `AlarmConditionType` node); **no Admin UI shelve controls** | Plant HMIs + OPC UA clients already speak these methods by spec; reinventing the UI adds surface without operator value. Admin still renders current shelve state + audit trail read-only on the alarm detail page. |
|
||||
| 21 | Dead-lettered historian events retained for **30 days** in the SQLite queue; Admin `/alarms/historian` exposes a "Retry dead-lettered" button | Long enough for a Historian outage or licensing glitch to be resolved + operator to investigate; short enough that the SQLite file doesn't grow unbounded. Configurable via `AlarmHistorian:DeadLetterRetentionDays` for deployments with stricter compliance windows. |
|
||||
| 22 | Test harness synthetic inputs = **declared inputs only** (from the AST walker's extracted dependency set) | Enforces the dependency declaration — if a path can't be supplied to the harness, the AST walker didn't see it and the script can't reference it at runtime. Catches dependency-inference drift at test time, not publish time. |
|
||||
|
||||
## Scope — What Changes
|
||||
|
||||
| Concern | Change |
|
||||
|---------|--------|
|
||||
| **New project `OtOpcUa.Core.Scripting`** (.NET 10) | Roslyn-based script engine. Compiles user C# scripts with a sandboxed `ScriptOptions` allow-list (numeric / string / datetime / `ScriptContext` API only — no reflection / File / Process / HttpClient). `DependencyExtractor` uses `CSharpSyntaxWalker` to enumerate `ctx.GetTag("...")` literal-string calls; rejects non-literal paths at publish time. Per-script compile cache keyed by source hash. Per-evaluation timeout. Exception in script → tag goes `BadInternalError`; engine unaffected for other tags. `ctx.Logger` is a Serilog `ILogger` bound to the `scripts-*.log` rolling sink with structured property `ScriptName`. |
|
||||
| **New project `OtOpcUa.Core.VirtualTags`** (.NET 10) | `VirtualTagEngine` consumes the `DependencyExtractor` output, builds a topological dependency graph spanning driver tags + other virtual tags (cycle detection at publish time), schedules re-evaluation on change + on timer, propagates results through an `IVirtualTagSource` that implements `IReadable` + `ISubscribable` so `DriverNodeManager` routes reads / subscriptions uniformly. Per-tag `Historize` flag routes to the same history-write path driver tags use. |
|
||||
| **New project `OtOpcUa.Core.ScriptedAlarms`** (.NET 10) | `ScriptedAlarmEngine` materializes each configured alarm as an OPC UA `AlarmConditionType` (or `LimitAlarmType` / `OffNormalAlarmType`). On startup, re-evaluates every predicate against current tag values to rebuild `ActiveState` — no persistence needed for the active flag. Persistent state: `EnabledState`, `AckedState`, `ConfirmedState`, `ShelvingState`, branch stack, ack audit (user/time/comment). Template message substitution resolves `{TagPath}` tokens at event emission. Ack / Confirm / Shelve method nodes bound to the engine; transitions audit-logged via the existing `IAuditLogger` (Phase 6.2). Registers as an additional `IAlarmSource` — no change to the existing fan-out. |
|
||||
| **New project `OtOpcUa.Core.AlarmHistorian`** (.NET 10) | `IAlarmHistorianSink` abstraction + `SqliteStoreAndForwardSink` default implementation. Every qualifying `IAlarmSource` emission (per-alarm `HistorizeToAveva` toggle) persists to a local SQLite queue (`%ProgramData%\OtOpcUa\alarm-historian-queue.db`). Background drain worker reads unsent rows + forwards over IPC to Galaxy.Host. Failed writes keep the row pending with exponential backoff. Queue capacity bounded (default 1M events, oldest-dropped with a structured warning log). |
|
||||
| **`Driver.Galaxy.Shared`** — new IPC contracts | `HistorianAlarmEventRequest` (activation / ack / confirm / clear / shelve / disable / comment payloads matching the Aveva Historian alarm schema) + `HistorianAlarmEventResponse` (ack / retry-please / permanent-fail). `HistorianConnectivityStatusNotification` so the main server can surface "Historian disconnected" on the Admin `/hosts` page. |
|
||||
| **`Driver.Galaxy.Host`** — new frame handler for alarm writes | Reuses the already-loaded `aahClientManaged.dll` + `aahClientCommon.dll`. Maps the IPC request DTOs to the historian SDK's alarm-event API (exact method TBD during Stream D.2 — needs a live-historian smoke to confirm the right SDK entry point). Errors map to structured response codes so the main server's backoff logic can distinguish "transient" from "permanent". |
|
||||
| **Config DB schema** — new tables | `VirtualTag (Id, EquipmentPath, Name, DataType, IntervalMs?, ChangeTriggerEnabled, Historize, ScriptId)`; `Script (Id, SourceCode, CompiledHash, Language='CSharp')`; `ScriptedAlarm (Id, EquipmentPath, Name, AlarmType, Severity, MessageTemplate, HistorizeToAveva, PredicateScriptId)`; `ScriptedAlarmState (AlarmId, EnabledState, AckedState, ConfirmedState, ShelvingState, ShelvingExpiresUtc?, LastAckUser, LastAckComment, LastAckUtc, BranchStack_JSON)`. Every write goes through `sp_PublishGeneration` + `IAuditLogger`. |
|
||||
| **Address-space build** — Phase 6 `EquipmentNodeWalker` extension | Emits virtual-tag nodes alongside driver-sourced nodes under the same Equipment folder. `NodeScopeResolver` gains a `Virtual` source kind alongside `Driver`. `DriverNodeManager` dispatch routes reads / writes / subscriptions to the `VirtualTagEngine` when the source is virtual. |
|
||||
| **Admin UI** — new tabs | `/virtual-tags` and `/scripted-alarms` tabs under the existing draft/publish flow. Monaco-based C# code editor (syntax highlighting, IntelliSense against a hand-written type stub for `ScriptContext`). Dependency preview panel shows the inferred input list from the AST walker. Test-harness lets operator supply synthetic `DataValue` inputs + see script output + logger emissions without publishing. Per-alarm controls: `AlarmType`, `Severity`, `MessageTemplate`, `HistorizeToAveva`. New `/alarms/historian` diagnostics view: queue depth, drain rate, last-successful-write, per-alarm "last routed to historian" timestamp. |
|
||||
| **`DriverTypeRegistry`** — no change | Scripting is not a driver — it doesn't register as a `DriverType`. The engine hangs off the same `SealedBootstrap` as drivers but through a different composition root. |
|
||||
|
||||
## Scope — What Does NOT Change
|
||||
|
||||
| Item | Reason |
|
||||
|------|--------|
|
||||
| Existing `IAlarmSource` implementations (Galaxy, AB CIP ALMD) | Scripted alarms register as an *additional* source; existing sources pass through unchanged. Default `HistorizeToAveva=false` for Galaxy alarms avoids duplicating records the Galaxy historian wiring already captures. |
|
||||
| Driver capability surface (`IReadable` / `IWritable` / `ISubscribable` / etc.) | Virtual tags implement the same interfaces — drivers and virtual tags are interchangeable from the node manager's perspective. No new capability. |
|
||||
| Config DB publication flow (`sp_PublishGeneration` + sealed cache) | Virtual tag + alarm tables plug in as additional rows. Atomic publish semantics unchanged. |
|
||||
| Authorization trie (Phase 6.2) | Virtual-tag nodes inherit the Equipment scope's grants — same treatment as the Phase 6.4 Identification sub-folder. No new scope level. |
|
||||
| Tier-C isolation topology | Scripting engine runs in the main .NET 10 server process. Roslyn scripts are already sandboxed via `ScriptOptions`; no need for process isolation because they have no unmanaged reach. Galaxy.Host's existing Tier-C boundary already owns the historian SDK writes. |
|
||||
| Galaxy alarm ingestion path into the historian | Galaxy writes alarms directly via `aahClientManaged` today; Phase 7 Stream D gives it a *second* path (via the new sink) when a Galaxy alarm has `HistorizeToAveva=true`, but the direct path stays for the default case. |
|
||||
| OPC UA wire protocol / AddressSpace schema | Clients see new nodes under existing folders + new alarm conditions. No new namespaces, no new ObjectTypes beyond what Part 9 already defines. |
|
||||
|
||||
## Entry Gate Checklist
|
||||
|
||||
- [ ] All Phase 6.x exit gates cleared (#133, #142, #151, #158)
|
||||
- [ ] Equipment node walker wired into `DriverNodeManager` (task #212 — done)
|
||||
- [ ] `IAuditLogger` surface live (Phase 6.2 Stream A)
|
||||
- [ ] `sp_PublishGeneration` + sealed-cache flow verified on the existing driver-config tables
|
||||
- [ ] Dev Aveva Historian reachable from the dev box (for Stream D.2 smoke)
|
||||
- [ ] `v2` branch clean + baseline tests green
|
||||
- [ ] Blazor editor component library picked (Monaco confirmed vs alternatives — see decision to log)
|
||||
- [ ] Review this plan — decisions #1–#17 signed off, no open questions
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
### Stream A — `Core.Scripting` (Roslyn engine + sandbox + AST inference + logger) — **2 weeks**
|
||||
|
||||
1. **A.1** Project scaffold + NuGet `Microsoft.CodeAnalysis.CSharp.Scripting`. `ScriptOptions` allow-list (`typeof(object).Assembly`, `typeof(Enumerable).Assembly`, the Core.Scripting assembly itself — nothing else). Hand-written `ScriptContext` base class with `GetTag(string)` / `SetVirtualTag(string, object)` / `Logger` / `Now` / `Deadband(double, double, double)` helpers.
|
||||
2. **A.2** `DependencyExtractor : CSharpSyntaxWalker`. Visits every `InvocationExpressionSyntax` targeting `ctx.GetTag` / `ctx.SetVirtualTag`; accepts only a `LiteralExpressionSyntax` argument. Non-literal arguments (concat, variable, method call) → publish-time rejection with an actionable error pointing the operator at the exact span. Outputs `IReadOnlySet<string> Inputs` + `IReadOnlySet<string> Outputs`.
|
||||
3. **A.3** Compile cache. `(source_hash) → compiled Script<T>`. Recompile only when source changes. Warm on `SealedBootstrap`.
|
||||
4. **A.4** Per-evaluation timeout wrapper (default 250ms; configurable per tag). Timeout = tag quality `BadInternalError` + structured warning log. Keeps a single runaway script from starving the engine.
|
||||
5. **A.5** Serilog sink wiring. New `scripts-*.log` rolling file enricher; `ctx.Logger` returns an `ILogger` with `ForContext("ScriptName", ...)`. Main `opcua-*.log` gets a companion entry at WARN level if a script logs ERROR, so the operator sees it in the primary log.
|
||||
6. **A.6** Tests: AST extraction unit tests (30+ cases covering literal / concat / variable / null / method-returned paths); sandbox escape tests (attempt `typeof`, `Assembly.Load`, `File.OpenRead` — all must fail at compile); exception isolation (throwing script doesn't kill the engine); timeout behavior; logger structured-property binding.
|
||||
|
||||
### Stream B — Virtual tag engine (dependency graph + change/timer schedulers + historize) — **1.5 weeks**
|
||||
|
||||
1. **B.1** `VirtualTagEngine`. Ingests the set of compiled scripts + their inputs/outputs; builds a directed dependency graph (driver tag ID → virtual tag ID → virtual tag ID). Cycle detection at publish-time via Tarjan; publish rejects with a clear error message listing the cycle.
|
||||
2. **B.2** `ChangeTriggerDispatcher`. Subscribes to every referenced driver tag via the existing `ISubscribable` fan-out. On a `DataValueSnapshot` delta (value / status / timestamp — any of the three), enqueues affected virtual tags for re-evaluation in topological order.
|
||||
3. **B.3** `TimerTriggerDispatcher`. Per-tag `IntervalMs` scheduled via a shared timer-wheel. Independent of change triggers — a tag can have both, either, or neither.
|
||||
4. **B.4** `EvaluationPipeline`. Serial evaluation per cascade (parallel promoted to a follow-up — avoids cross-tag ordering bugs on first rollout). Exception handling per A.4; propagates results via `IVirtualTagSource`.
|
||||
5. **B.5** `IVirtualTagSource` implementation. Implements `IReadable` + `ISubscribable`. Reads return the most recent evaluated value; subscriptions receive `OnDataChange` events on each re-evaluation.
|
||||
6. **B.6** History routing. Per-tag `Historize` flag emits the value + timestamp to the existing history-write path used by drivers.
|
||||
7. **B.7** Tests: dependency graph (happy + cycle); change cascade through two levels of virtual tags; timer-only tag ignores input changes; change + timer both configured; error propagation; historize on/off.
|
||||
|
||||
### Stream C — Scripted alarm engine + Part 9 state machine + template messages — **2.5 weeks**
|
||||
|
||||
1. **C.1** Alarm config model + `ScriptedAlarmEngine` skeleton. Alarms materialize as `AlarmConditionType` (or subtype — `LimitAlarm`, `OffNormal`) nodes under their configured Equipment path. Severity loaded from config.
|
||||
2. **C.2** `Part9StateMachine`. Tracks `EnabledState`, `ActiveState`, `AckedState`, `ConfirmedState`, `ShelvingState` per condition ID. Shelving has `OneShotShelving` + `TimedShelving` variants + an `UnshelveTime` timer.
|
||||
3. **C.3** Predicate evaluation. On any input change (same trigger mechanism as Stream B), run the `bool` predicate. On `false → true` transition, activate (increment branch stack if prior Ack-but-not-Confirmed state exists). On `true → false`, clear (but keep condition visible if retain flag set).
|
||||
4. **C.4** Startup recovery. For every configured alarm, run the predicate against current tag values to rebuild `ActiveState` *only*. Load `EnabledState` / `AckedState` / `ConfirmedState` / `ShelvingState` + audit from the `ScriptedAlarmState` table. No re-acknowledgment required for conditions that were acked before restart.
|
||||
5. **C.5** Template substitution. Engine resolves `{TagPath}` tokens in `MessageTemplate` at event emission time using current tag values. Unresolvable tokens (bad path, missing tag) emit a structured error log + substitute `{?}` so the event still fires.
|
||||
6. **C.6** OPC UA method binding. `Acknowledge`, `Confirm`, `AddComment`, `OneShotShelve`, `TimedShelve`, `Unshelve` methods on each condition node route to the engine + persist via audit-logged writes to `ScriptedAlarmState`.
|
||||
7. **C.7** `IAlarmSource` implementation. Emits Part 9-shaped events through the existing fan-out the `AlarmTracker` composes.
|
||||
8. **C.8** Tests: every transition (all 32 state combinations the state machine can produce); startup recovery (seed table with varied ack/confirm/shelve state, restart, verify correct recovery); template substitution (literal path, nested path, bad path); shelving timer expiry; OPC UA method calls via Client.CLI.
|
||||
|
||||
### Stream D — Historian alarm sink (SQLite store-and-forward + Galaxy.Host IPC) — **2 weeks**
|
||||
|
||||
1. **D.1** `Core.AlarmHistorian` project. `IAlarmHistorianSink` interface; `SqliteStoreAndForwardSink` default implementation using Microsoft.Data.Sqlite. Schema: `Queue (RowId, AlarmId, EventType, PayloadJson, EnqueuedUtc, LastAttemptUtc?, AttemptCount, DeadLettered)`. Queue capacity bounded; oldest-dropped on overflow with structured warning.
|
||||
2. **D.2** **Live-historian smoke** against the dev box's Aveva Historian. Identify the exact `aahClientManaged` alarm-write API entry point (likely `IAlarmsDatabase.WriteAlarmEvent` or equivalent — verify with a throwaway Galaxy.Host test hook). Document in a short `docs/v2/historian-alarm-api.md` artifact.
|
||||
3. **D.3** `Driver.Galaxy.Shared` contract additions. `HistorianAlarmEventRequest` / `HistorianAlarmEventResponse` / `HistorianConnectivityStatusNotification`. Round-trip tests in `Driver.Galaxy.Shared.Tests`.
|
||||
4. **D.4** `Driver.Galaxy.Host` handler. Translates incoming `HistorianAlarmEventRequest` to the SDK call identified in D.2. Returns structured response (Ack / RetryPlease / PermanentFail). Connectivity notifications sent proactively when the SDK's session drops.
|
||||
5. **D.5** Drain worker in the main server. Polls the SQLite queue; batches up to 100 events per IPC round-trip; exponential backoff on `RetryPlease` (1s → 2s → 5s → 15s → 60s cap); `PermanentFail` dead-letters the row + structured error log.
|
||||
6. **D.6** Per-alarm toggle wired through: `HistorizeToAveva` column on both `ScriptedAlarm` + a new `AlarmHistorizationPolicy` projection the Galaxy / ALMD alarm sources consult (default `false` for Galaxy, `true` for scripted, operator-adjustable per-alarm).
|
||||
7. **D.7** `/alarms/historian` diagnostics view in Admin. Queue depth, drain rate, last-successful-write, last-error, per-alarm last-routed timestamp.
|
||||
8. **D.8** Tests: SQLite queue round-trip; drain worker with fake IPC (success / retry / perm-fail); overflow eviction; Galaxy.Host handler against a stub historian API; end-to-end with the live historian on the dev box (non-CI — operator-invoked).
|
||||
|
||||
### Stream E — Config DB schema + generation-sealed cache extensions — **1 week**
|
||||
|
||||
1. **E.1** EF migration for new tables. Foreign keys from `VirtualTag.ScriptId` / `ScriptedAlarm.PredicateScriptId` to `Script.Id`.
|
||||
2. **E.2** `sp_PublishGeneration` extension. Sealed-cache snapshot includes virtual tags + scripted alarms + their scripts. Atomic publish guarantees the address-space build sees a consistent view.
|
||||
3. **E.3** CRUD services. `VirtualTagService`, `ScriptedAlarmService`, `ScriptService`. Each audit-logged; Ack / Confirm / Shelve persist through `ScriptedAlarmStateService` with full audit trail (who / when / comment / previous state).
|
||||
4. **E.4** Tests: migration up / down; publish atomicity (concurrent writes to different alarm rows don't leak into an in-flight publish); audit trail on every mutation.
|
||||
|
||||
### Stream F — Admin UI scripting tab — **2 weeks**
|
||||
|
||||
1. **F.1** Monaco editor Razor component. CSS-isolated; loads Monaco via NPM + the Admin project's existing asset pipeline. C# syntax highlighting (Monaco ships it). IntelliSense via a hand-written `ScriptContext.cs` type stub delivered with the editor (not the compiled Core.Scripting DLL — keeps the browser bundle small).
|
||||
2. **F.2** `/virtual-tags` tab. List view (Equipment path / Name / DataType / inputs-summary / Historize / actions). Edit pane splits: Monaco editor left, dependency preview panel right (live-updates from a debounced `/api/scripting/analyze` endpoint that runs the `DependencyExtractor`). Publish button gated by Phase 6.2 `WriteConfigure` permission.
|
||||
3. **F.3** `/scripted-alarms` tab. Same editor shape + extra controls: AlarmType dropdown, Severity slider, MessageTemplate textbox with live-preview showing `{path}` token resolution against latest tag values, `HistorizeToAveva` checkbox. **Alarm detail page displays current `ShelvingState` + `LastAckUser / LastAckUtc / LastAckComment` read-only** — no shelve/unshelve / ack / confirm buttons per decision #20. Operators drive state transitions via OPC UA method calls from plant HMIs or the Client.CLI.
|
||||
4. **F.4** Test harness. Modal that lets the operator supply synthetic `DataValue` inputs for the dependency set + see script output + logger emissions (rendered in a virtual terminal). Enables testing without publishing.
|
||||
5. **F.5** Script log viewer. SignalR stream of the `scripts-*.log` sink filtered by the script under edit (using the structured `ScriptName` property). Tail-last-200 + "load more".
|
||||
6. **F.6** `/alarms/historian` diagnostics view per Stream D.7.
|
||||
7. **F.7** Playwright smoke. Author a calc tag, publish, verify it appears in the equipment tree via a probe OPC UA read. Author an alarm, verify it appears in `AlarmsAndConditions`.
|
||||
|
||||
### Stream G — Address-space integration — **1 week**
|
||||
|
||||
1. **G.1** `EquipmentNodeWalker` extension. Current walker iterates driver tags per equipment; extend to also iterate virtual tags + alarms. `NodeScopeResolver` returns `NodeSource.Virtual` for virtual nodes and `NodeSource.Driver` for existing.
|
||||
2. **G.2** `DriverNodeManager` dispatch. Read / Write / Subscribe operations check the resolved source and route to `VirtualTagEngine` or the driver as appropriate. Writes to virtual tags allowed only from scripts (per decision #6) — OPC UA client writes to a virtual node return `BadUserAccessDenied`.
|
||||
3. **G.3** `AlarmTracker` composition. The `ScriptedAlarmEngine` registers as an additional `IAlarmSource` — no new composition code, the existing fan-out already accepts multiple sources.
|
||||
4. **G.4** Tests: mixed equipment folder (driver tag + virtual tag + driver-native alarm + scripted alarm) browsable via Client.CLI; read / subscribe round-trip for the virtual tag; scripted alarm transitions visible in the alarm event stream.
|
||||
|
||||
### Stream H — Exit gate — **1 week**
|
||||
|
||||
1. **H.1** Compliance script real-checks: schema migrations applied; new tables populated from a draft→publish cycle; sealed-generation snapshot includes virtual tags + alarms; SQLite alarm queue initialized; `scripts-*.log` sink emitting; `AlarmConditionType` nodes materialize in the address space; per-alarm `HistorizeToAveva` toggle enforced end-to-end.
|
||||
2. **H.2** Full-solution `dotnet test` baseline. Target: Phase 6 baseline + ~300 new tests across Streams A–G.
|
||||
3. **H.3** `docs/v2/plan.md` Migration Strategy §6 update — add Phase 7.
|
||||
4. **H.4** Phase-status memory update.
|
||||
5. **H.5** Merge `v2/phase-7-scripting-and-alarming` → `v2`.
|
||||
|
||||
## Compliance Checks (run at exit gate)
|
||||
|
||||
- [ ] **Sandbox escape**: attempts to reference `System.IO.File`, `System.Net.Http.HttpClient`, `System.Diagnostics.Process`, or `typeof(X).Assembly.Load` fail at script compile with an actionable error.
|
||||
- [ ] **Dependency inference**: `ctx.GetTag(myStringVar)` (non-literal path) is rejected at publish with a span-pointed error; `ctx.GetTag("Line1/Speed")` is accepted + appears in the inferred input set.
|
||||
- [ ] **Change cascade**: tag A → virtual tag B → virtual tag C. When A changes, B recomputes, then C recomputes. Single change event triggers the full cascade in topological order within one evaluation pass.
|
||||
- [ ] **Cycle rejection**: publish a config where virtual tag B depends on A and A depends on B. Publish fails pre-commit with a clear cycle message.
|
||||
- [ ] **Startup recovery**: seed `ScriptedAlarmState` with one acked+confirmed alarm + one shelved alarm + one clean alarm, restart, verify operator does NOT see ack prompts for the first two, shelving remains in effect, clean alarm is clear.
|
||||
- [ ] **Ack audit**: acknowledge an alarm; `IAuditLogger` captures user / timestamp / comment / prior state; row persists through restart.
|
||||
- [ ] **Historian queue durability**: take Galaxy.Host offline, fire 10 alarm transitions, bring Galaxy.Host back; queue drains all 10 in order.
|
||||
- [ ] **Per-alarm historian toggle**: Galaxy-native alarm with `HistorizeToAveva=false` does NOT enqueue; scripted alarm with `HistorizeToAveva=true` DOES enqueue.
|
||||
- [ ] **Script timeout**: infinite-loop script times out at 250ms; tag quality `BadInternalError`; other tags unaffected.
|
||||
- [ ] **Log isolation**: `ctx.Logger.Error("test")` lands in `scripts-*.log` with structured property `ScriptName=<name>`; main `opcua-*.log` gets a WARN companion entry.
|
||||
- [ ] **ACL binding**: virtual tag under an Equipment scope inherits the Equipment's grants. User without the Equipment grant reads the virtual tag and gets `BadUserAccessDenied`.
|
||||
|
||||
## Decisions Resolved in Plan Review
|
||||
|
||||
Every open question from the initial draft was resolved in the 2026-04-20 plan review — see decisions #18–#22 in the decisions table above. No pending questions block Stream A.
|
||||
|
||||
## References
|
||||
|
||||
- [`docs/v2/plan.md`](../plan.md) §6 Migration Strategy — add Phase 7 as the final additive phase before v2 release readiness.
|
||||
- [`docs/v2/implementation/overview.md`](overview.md) — phase gate conventions.
|
||||
- [`docs/v2/implementation/phase-6-2-authorization-runtime.md`](phase-6-2-authorization-runtime.md) — `IAuditLogger` surface reused for Ack/Confirm/Shelve + script edits.
|
||||
- [`docs/v2/implementation/phase-6-4-admin-ui-completion.md`](phase-6-4-admin-ui-completion.md) — draft/publish flow, diff viewer, tab-plugin pattern reused.
|
||||
- [`docs/v2/implementation/phase-2-galaxy-out-of-process.md`](phase-2-galaxy-out-of-process.md) — Galaxy.Host IPC shape + shared-contract conventions reused for Stream D.
|
||||
- [`docs/v2/driver-specs.md`](../driver-specs.md) §Alarm semantics — Part 9 fidelity requirements.
|
||||
- [`docs/v2/driver-stability.md`](../driver-stability.md) — per-surface error handling, crash-loop breaker patterns Stream A.4 mirrors.
|
||||
- [`docs/v2/config-db-schema.md`](../config-db-schema.md) — add a Phase 7 §§ for `VirtualTag`, `Script`, `ScriptedAlarm`, `ScriptedAlarmState`.
|
||||
108
scripts/install/Install-FocasHost.ps1
Normal file
108
scripts/install/Install-FocasHost.ps1
Normal file
@@ -0,0 +1,108 @@
|
||||
<#
|
||||
.SYNOPSIS
|
||||
Registers the OtOpcUaFocasHost Windows service. Optional companion to
|
||||
Install-Services.ps1 — only run this on nodes where FOCAS driver instances will run
|
||||
with Tier-C process isolation enabled.
|
||||
|
||||
.DESCRIPTION
|
||||
FOCAS PR #220 / Tier-C isolation plan. Wraps OtOpcUa.Driver.FOCAS.Host.exe (net48 x86)
|
||||
as a Windows service using NSSM, running under the same service account as the main
|
||||
OtOpcUa service so the named-pipe ACL works. Passes the per-process shared secret via
|
||||
environment variable at service-start time so it never hits disk.
|
||||
|
||||
.PARAMETER InstallRoot
|
||||
Where the FOCAS Host binaries live (typically
|
||||
C:\Program Files\OtOpcUa\Driver.FOCAS.Host).
|
||||
|
||||
.PARAMETER ServiceAccount
|
||||
Service account SID or DOMAIN\name. Must match the main OtOpcUa server account so the
|
||||
PipeAcl match succeeds.
|
||||
|
||||
.PARAMETER FocasSharedSecret
|
||||
Per-process secret passed via env var. Generated freshly per install if not supplied.
|
||||
|
||||
.PARAMETER FocasBackend
|
||||
Backend selector for the Host process. One of:
|
||||
fwlib32 (default — real Fanuc Fwlib32.dll integration; requires licensed DLL on PATH)
|
||||
fake (in-memory; smoke-test mode)
|
||||
unconfigured (safe default returning structured errors; use until hardware is wired)
|
||||
|
||||
.PARAMETER FocasPipeName
|
||||
Pipe name the Host listens on. Default: OtOpcUaFocas.
|
||||
|
||||
.EXAMPLE
|
||||
.\Install-FocasHost.ps1 -InstallRoot 'C:\Program Files\OtOpcUa\Driver.FOCAS.Host' `
|
||||
-ServiceAccount 'OTOPCUA\svc-otopcua' -FocasBackend fwlib32
|
||||
#>
|
||||
[CmdletBinding()]
|
||||
param(
|
||||
[Parameter(Mandatory)] [string]$InstallRoot,
|
||||
[Parameter(Mandatory)] [string]$ServiceAccount,
|
||||
[string]$FocasSharedSecret,
|
||||
[ValidateSet('fwlib32','fake','unconfigured')] [string]$FocasBackend = 'unconfigured',
|
||||
[string]$FocasPipeName = 'OtOpcUaFocas',
|
||||
[string]$ServiceName = 'OtOpcUaFocasHost',
|
||||
[string]$NssmPath = 'C:\Program Files\nssm\nssm.exe'
|
||||
)
|
||||
|
||||
$ErrorActionPreference = 'Stop'
|
||||
|
||||
function Resolve-Sid {
|
||||
param([string]$Account)
|
||||
if ($Account -match '^S-\d-\d+') { return $Account }
|
||||
try {
|
||||
$nt = New-Object System.Security.Principal.NTAccount($Account)
|
||||
return $nt.Translate([System.Security.Principal.SecurityIdentifier]).Value
|
||||
} catch {
|
||||
throw "Could not resolve '$Account' to a SID. Pass an explicit SID or check the account name."
|
||||
}
|
||||
}
|
||||
|
||||
if (-not (Test-Path $NssmPath)) {
|
||||
throw "nssm.exe not found at '$NssmPath'. Install NSSM or pass -NssmPath."
|
||||
}
|
||||
|
||||
$hostExe = Join-Path $InstallRoot 'OtOpcUa.Driver.FOCAS.Host.exe'
|
||||
if (-not (Test-Path $hostExe)) {
|
||||
throw "FOCAS Host binary not found at '$hostExe'. Publish the Driver.FOCAS.Host project first."
|
||||
}
|
||||
|
||||
if (-not $FocasSharedSecret) {
|
||||
$FocasSharedSecret = [System.Guid]::NewGuid().ToString('N')
|
||||
Write-Host "Generated FocasSharedSecret — store it alongside the OtOpcUa service config."
|
||||
}
|
||||
|
||||
$allowedSid = Resolve-Sid $ServiceAccount
|
||||
|
||||
# Idempotent install — remove + re-create if present.
|
||||
$existing = Get-Service -Name $ServiceName -ErrorAction SilentlyContinue
|
||||
if ($existing) {
|
||||
Write-Host "Removing existing '$ServiceName' service..."
|
||||
& $NssmPath stop $ServiceName confirm | Out-Null
|
||||
& $NssmPath remove $ServiceName confirm | Out-Null
|
||||
}
|
||||
|
||||
& $NssmPath install $ServiceName $hostExe | Out-Null
|
||||
& $NssmPath set $ServiceName DisplayName 'OT-OPC-UA FOCAS Host (Tier-C isolated Fwlib32)' | Out-Null
|
||||
& $NssmPath set $ServiceName Description 'Out-of-process Fwlib32.dll host for OtOpcUa FOCAS driver. Crash-isolated from the main OPC UA server.' | Out-Null
|
||||
& $NssmPath set $ServiceName ObjectName $ServiceAccount | Out-Null
|
||||
& $NssmPath set $ServiceName Start SERVICE_AUTO_START | Out-Null
|
||||
& $NssmPath set $ServiceName AppStdout (Join-Path $env:ProgramData 'OtOpcUa\focas-host-stdout.log') | Out-Null
|
||||
& $NssmPath set $ServiceName AppStderr (Join-Path $env:ProgramData 'OtOpcUa\focas-host-stderr.log') | Out-Null
|
||||
& $NssmPath set $ServiceName AppRotateFiles 1 | Out-Null
|
||||
& $NssmPath set $ServiceName AppRotateBytes 10485760 | Out-Null
|
||||
|
||||
& $NssmPath set $ServiceName AppEnvironmentExtra `
|
||||
"OTOPCUA_FOCAS_PIPE=$FocasPipeName" `
|
||||
"OTOPCUA_ALLOWED_SID=$allowedSid" `
|
||||
"OTOPCUA_FOCAS_SECRET=$FocasSharedSecret" `
|
||||
"OTOPCUA_FOCAS_BACKEND=$FocasBackend" | Out-Null
|
||||
|
||||
& $NssmPath set $ServiceName DependOnService OtOpcUa | Out-Null
|
||||
|
||||
Write-Host "Installed '$ServiceName' under '$ServiceAccount' (SID=$allowedSid)."
|
||||
Write-Host "Pipe: \\.\pipe\$FocasPipeName Backend: $FocasBackend"
|
||||
Write-Host "Start the service with: Start-Service $ServiceName"
|
||||
Write-Host ""
|
||||
Write-Host "NOTE: the Fwlib32 backend requires the licensed Fwlib32.dll on PATH"
|
||||
Write-Host "alongside the Host exe. See docs/v2/focas-deployment.md."
|
||||
27
scripts/install/Uninstall-FocasHost.ps1
Normal file
27
scripts/install/Uninstall-FocasHost.ps1
Normal file
@@ -0,0 +1,27 @@
|
||||
<#
|
||||
.SYNOPSIS
|
||||
Removes the OtOpcUaFocasHost Windows service.
|
||||
|
||||
.DESCRIPTION
|
||||
Companion to Install-FocasHost.ps1. Stops + unregisters the service via NSSM.
|
||||
Idempotent — succeeds silently if the service doesn't exist.
|
||||
|
||||
.EXAMPLE
|
||||
.\Uninstall-FocasHost.ps1
|
||||
#>
|
||||
[CmdletBinding()]
|
||||
param(
|
||||
[string]$ServiceName = 'OtOpcUaFocasHost',
|
||||
[string]$NssmPath = 'C:\Program Files\nssm\nssm.exe'
|
||||
)
|
||||
|
||||
$ErrorActionPreference = 'Stop'
|
||||
|
||||
$svc = Get-Service -Name $ServiceName -ErrorAction SilentlyContinue
|
||||
if (-not $svc) { Write-Host "Service '$ServiceName' not present — nothing to do."; return }
|
||||
|
||||
if (-not (Test-Path $NssmPath)) { throw "nssm.exe not found at '$NssmPath'." }
|
||||
|
||||
& $NssmPath stop $ServiceName confirm | Out-Null
|
||||
& $NssmPath remove $ServiceName confirm | Out-Null
|
||||
Write-Host "Removed '$ServiceName'."
|
||||
@@ -0,0 +1,133 @@
|
||||
using System;
|
||||
using System.IO;
|
||||
using System.IO.MemoryMappedFiles;
|
||||
using System.Text;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host.Stability;
|
||||
|
||||
/// <summary>
|
||||
/// Ring-buffer of the last N IPC operations, written into a memory-mapped file. On a
|
||||
/// hard crash the Proxy-side supervisor reads the MMF after the corpse is gone to see
|
||||
/// what was in flight at the moment the Host died. Single-writer (the Host), multi-reader
|
||||
/// (the supervisor) — the file format is identical to the Galaxy Tier-C
|
||||
/// <c>PostMortemMmf</c> so a single reader tool can work both.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// File layout:
|
||||
/// <code>
|
||||
/// [16-byte header: magic(4) | version(4) | capacity(4) | writeIndex(4)]
|
||||
/// [capacity × 256-byte entries: each is [8-byte utcUnixMs | 8-byte opKind | 240-byte UTF-8 message]]
|
||||
/// </code>
|
||||
/// Magic is 'OFPC' (0x4F46_5043) to distinguish a FOCAS file from the Galaxy MMF.
|
||||
/// </remarks>
|
||||
public sealed class PostMortemMmf : IDisposable
|
||||
{
|
||||
private const int Magic = 0x4F465043; // 'OFPC'
|
||||
private const int Version = 1;
|
||||
private const int HeaderBytes = 16;
|
||||
public const int EntryBytes = 256;
|
||||
private const int MessageOffset = 16;
|
||||
private const int MessageCapacity = EntryBytes - MessageOffset;
|
||||
|
||||
public int Capacity { get; }
|
||||
public string Path { get; }
|
||||
|
||||
private readonly MemoryMappedFile _mmf;
|
||||
private readonly MemoryMappedViewAccessor _accessor;
|
||||
private readonly object _writeGate = new();
|
||||
|
||||
public PostMortemMmf(string path, int capacity = 1000)
|
||||
{
|
||||
if (capacity <= 0) throw new ArgumentOutOfRangeException(nameof(capacity));
|
||||
Capacity = capacity;
|
||||
Path = path;
|
||||
|
||||
var fileBytes = HeaderBytes + capacity * EntryBytes;
|
||||
Directory.CreateDirectory(System.IO.Path.GetDirectoryName(path)!);
|
||||
|
||||
var fs = new FileStream(path, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.Read);
|
||||
fs.SetLength(fileBytes);
|
||||
_mmf = MemoryMappedFile.CreateFromFile(fs, null, fileBytes,
|
||||
MemoryMappedFileAccess.ReadWrite, HandleInheritability.None, leaveOpen: false);
|
||||
_accessor = _mmf.CreateViewAccessor(0, fileBytes, MemoryMappedFileAccess.ReadWrite);
|
||||
|
||||
if (_accessor.ReadInt32(0) != Magic)
|
||||
{
|
||||
_accessor.Write(0, Magic);
|
||||
_accessor.Write(4, Version);
|
||||
_accessor.Write(8, capacity);
|
||||
_accessor.Write(12, 0);
|
||||
}
|
||||
}
|
||||
|
||||
public void Write(long opKind, string message)
|
||||
{
|
||||
lock (_writeGate)
|
||||
{
|
||||
var idx = _accessor.ReadInt32(12);
|
||||
var offset = HeaderBytes + idx * EntryBytes;
|
||||
|
||||
_accessor.Write(offset + 0, DateTimeOffset.UtcNow.ToUnixTimeMilliseconds());
|
||||
_accessor.Write(offset + 8, opKind);
|
||||
|
||||
var msgBytes = Encoding.UTF8.GetBytes(message ?? string.Empty);
|
||||
var copy = Math.Min(msgBytes.Length, MessageCapacity - 1);
|
||||
_accessor.WriteArray(offset + MessageOffset, msgBytes, 0, copy);
|
||||
_accessor.Write(offset + MessageOffset + copy, (byte)0);
|
||||
|
||||
var next = (idx + 1) % Capacity;
|
||||
_accessor.Write(12, next);
|
||||
}
|
||||
}
|
||||
|
||||
public PostMortemEntry[] ReadAll()
|
||||
{
|
||||
var magic = _accessor.ReadInt32(0);
|
||||
if (magic != Magic) return new PostMortemEntry[0];
|
||||
|
||||
var capacity = _accessor.ReadInt32(8);
|
||||
var writeIndex = _accessor.ReadInt32(12);
|
||||
|
||||
var entries = new PostMortemEntry[capacity];
|
||||
var count = 0;
|
||||
for (var i = 0; i < capacity; i++)
|
||||
{
|
||||
var slot = (writeIndex + i) % capacity;
|
||||
var offset = HeaderBytes + slot * EntryBytes;
|
||||
|
||||
var ts = _accessor.ReadInt64(offset + 0);
|
||||
if (ts == 0) continue;
|
||||
|
||||
var op = _accessor.ReadInt64(offset + 8);
|
||||
var msgBuf = new byte[MessageCapacity];
|
||||
_accessor.ReadArray(offset + MessageOffset, msgBuf, 0, MessageCapacity);
|
||||
var nulTerm = Array.IndexOf<byte>(msgBuf, 0);
|
||||
var msg = Encoding.UTF8.GetString(msgBuf, 0, nulTerm < 0 ? MessageCapacity : nulTerm);
|
||||
|
||||
entries[count++] = new PostMortemEntry(ts, op, msg);
|
||||
}
|
||||
|
||||
Array.Resize(ref entries, count);
|
||||
return entries;
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
_accessor.Dispose();
|
||||
_mmf.Dispose();
|
||||
}
|
||||
}
|
||||
|
||||
public readonly struct PostMortemEntry
|
||||
{
|
||||
public long UtcUnixMs { get; }
|
||||
public long OpKind { get; }
|
||||
public string Message { get; }
|
||||
|
||||
public PostMortemEntry(long utcUnixMs, long opKind, string message)
|
||||
{
|
||||
UtcUnixMs = utcUnixMs;
|
||||
OpKind = opKind;
|
||||
Message = message;
|
||||
}
|
||||
}
|
||||
30
src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/Supervisor/Backoff.cs
Normal file
30
src/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/Supervisor/Backoff.cs
Normal file
@@ -0,0 +1,30 @@
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
/// <summary>
|
||||
/// Respawn-with-backoff schedule for the FOCAS Host process. Matches Galaxy Tier-C:
|
||||
/// 5s → 15s → 60s cap. A sustained stable run (default 2 min) resets the index so a
|
||||
/// one-off crash after hours of steady-state doesn't start from the top of the ladder.
|
||||
/// </summary>
|
||||
public sealed class Backoff
|
||||
{
|
||||
public static TimeSpan[] DefaultSequence { get; } =
|
||||
[TimeSpan.FromSeconds(5), TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(60)];
|
||||
|
||||
public TimeSpan StableRunThreshold { get; init; } = TimeSpan.FromMinutes(2);
|
||||
|
||||
private readonly TimeSpan[] _sequence;
|
||||
private int _index;
|
||||
|
||||
public Backoff(TimeSpan[]? sequence = null) => _sequence = sequence ?? DefaultSequence;
|
||||
|
||||
public TimeSpan Next()
|
||||
{
|
||||
var delay = _sequence[Math.Min(_index, _sequence.Length - 1)];
|
||||
_index++;
|
||||
return delay;
|
||||
}
|
||||
|
||||
public void RecordStableRun() => _index = 0;
|
||||
|
||||
public int AttemptIndex => _index;
|
||||
}
|
||||
@@ -0,0 +1,69 @@
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
/// <summary>
|
||||
/// Crash-loop circuit breaker for the FOCAS Host. Matches Galaxy Tier-C defaults:
|
||||
/// 3 crashes within 5 minutes opens the breaker; cooldown escalates 1h → 4h → manual
|
||||
/// reset. A sticky alert stays live until the operator explicitly clears it so
|
||||
/// recurring crashes can't silently burn through the cooldown ladder overnight.
|
||||
/// </summary>
|
||||
public sealed class CircuitBreaker
|
||||
{
|
||||
public int CrashesAllowedPerWindow { get; init; } = 3;
|
||||
public TimeSpan Window { get; init; } = TimeSpan.FromMinutes(5);
|
||||
|
||||
public TimeSpan[] CooldownEscalation { get; init; } =
|
||||
[TimeSpan.FromHours(1), TimeSpan.FromHours(4), TimeSpan.MaxValue];
|
||||
|
||||
private readonly List<DateTime> _crashesUtc = [];
|
||||
private DateTime? _openSinceUtc;
|
||||
private int _escalationLevel;
|
||||
public bool StickyAlertActive { get; private set; }
|
||||
|
||||
/// <summary>
|
||||
/// Records a crash + returns <c>true</c> if the supervisor may respawn. On
|
||||
/// <c>false</c>, <paramref name="cooldownRemaining"/> is how long to wait before
|
||||
/// trying again (<c>TimeSpan.MaxValue</c> means manual reset required).
|
||||
/// </summary>
|
||||
public bool TryRecordCrash(DateTime utcNow, out TimeSpan cooldownRemaining)
|
||||
{
|
||||
if (_openSinceUtc is { } openedAt)
|
||||
{
|
||||
var cooldown = CooldownEscalation[Math.Min(_escalationLevel, CooldownEscalation.Length - 1)];
|
||||
if (cooldown == TimeSpan.MaxValue)
|
||||
{
|
||||
cooldownRemaining = TimeSpan.MaxValue;
|
||||
return false;
|
||||
}
|
||||
if (utcNow - openedAt < cooldown)
|
||||
{
|
||||
cooldownRemaining = cooldown - (utcNow - openedAt);
|
||||
return false;
|
||||
}
|
||||
|
||||
_openSinceUtc = null;
|
||||
_escalationLevel++;
|
||||
}
|
||||
|
||||
_crashesUtc.RemoveAll(t => utcNow - t > Window);
|
||||
_crashesUtc.Add(utcNow);
|
||||
|
||||
if (_crashesUtc.Count > CrashesAllowedPerWindow)
|
||||
{
|
||||
_openSinceUtc = utcNow;
|
||||
StickyAlertActive = true;
|
||||
cooldownRemaining = CooldownEscalation[Math.Min(_escalationLevel, CooldownEscalation.Length - 1)];
|
||||
return false;
|
||||
}
|
||||
|
||||
cooldownRemaining = TimeSpan.Zero;
|
||||
return true;
|
||||
}
|
||||
|
||||
public void ManualReset()
|
||||
{
|
||||
_crashesUtc.Clear();
|
||||
_openSinceUtc = null;
|
||||
_escalationLevel = 0;
|
||||
StickyAlertActive = false;
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,159 @@
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
/// <summary>
|
||||
/// Ties <see cref="IHostProcessLauncher"/> + <see cref="Backoff"/> +
|
||||
/// <see cref="CircuitBreaker"/> + <see cref="HeartbeatMonitor"/> into one object the
|
||||
/// driver asks for <c>IFocasClient</c>s. On a detected crash (process exit or
|
||||
/// heartbeat loss) the supervisor fans out <c>BadCommunicationError</c> to all
|
||||
/// subscribers via the <see cref="OnUnavailable"/> callback, then respawns with
|
||||
/// backoff unless the breaker is open.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// The supervisor itself is I/O-free — it doesn't know how to spawn processes, probe
|
||||
/// pipes, or send heartbeats. Production wires the concrete
|
||||
/// <see cref="IHostProcessLauncher"/> over <c>FocasIpcClient</c> + <c>Process</c>;
|
||||
/// tests drive the same state machine with a deterministic launcher stub.
|
||||
/// </remarks>
|
||||
public sealed class FocasHostSupervisor : IDisposable
|
||||
{
|
||||
private readonly IHostProcessLauncher _launcher;
|
||||
private readonly Backoff _backoff;
|
||||
private readonly CircuitBreaker _breaker;
|
||||
private readonly Func<DateTime> _clock;
|
||||
private IFocasClient? _current;
|
||||
private DateTime _currentStartedUtc;
|
||||
private bool _disposed;
|
||||
|
||||
public FocasHostSupervisor(
|
||||
IHostProcessLauncher launcher,
|
||||
Backoff? backoff = null,
|
||||
CircuitBreaker? breaker = null,
|
||||
Func<DateTime>? clock = null)
|
||||
{
|
||||
_launcher = launcher ?? throw new ArgumentNullException(nameof(launcher));
|
||||
_backoff = backoff ?? new Backoff();
|
||||
_breaker = breaker ?? new CircuitBreaker();
|
||||
_clock = clock ?? (() => DateTime.UtcNow);
|
||||
}
|
||||
|
||||
/// <summary>Raised with a short reason string whenever the Host goes unavailable (crash / heartbeat loss / breaker-open).</summary>
|
||||
public event Action<string>? OnUnavailable;
|
||||
|
||||
/// <summary>Crash count observed in the current process lifetime. Exposed for /hosts Admin telemetry.</summary>
|
||||
public int ObservedCrashes { get; private set; }
|
||||
|
||||
/// <summary><c>true</c> if the crash-loop breaker has latched a sticky alert that needs operator reset.</summary>
|
||||
public bool StickyAlertActive => _breaker.StickyAlertActive;
|
||||
|
||||
public int BackoffAttempt => _backoff.AttemptIndex;
|
||||
|
||||
/// <summary>
|
||||
/// Returns the current live client. If none, tries to launch — applying the
|
||||
/// backoff schedule between attempts and stopping once the breaker opens.
|
||||
/// </summary>
|
||||
public async Task<IFocasClient> GetOrLaunchAsync(CancellationToken ct)
|
||||
{
|
||||
ThrowIfDisposed();
|
||||
if (_current is not null && _launcher.IsProcessAlive) return _current;
|
||||
|
||||
return await LaunchWithBackoffAsync(ct).ConfigureAwait(false);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Called by the heartbeat task each time a miss threshold is crossed.
|
||||
/// Treated as a crash: fan out Bad status + attempt respawn.
|
||||
/// </summary>
|
||||
public async Task NotifyHostDeadAsync(string reason, CancellationToken ct)
|
||||
{
|
||||
ThrowIfDisposed();
|
||||
OnUnavailable?.Invoke(reason);
|
||||
ObservedCrashes++;
|
||||
try { await _launcher.TerminateAsync(ct).ConfigureAwait(false); }
|
||||
catch { /* best effort */ }
|
||||
_current?.Dispose();
|
||||
_current = null;
|
||||
|
||||
if (!_breaker.TryRecordCrash(_clock(), out var cooldown))
|
||||
{
|
||||
OnUnavailable?.Invoke(cooldown == TimeSpan.MaxValue
|
||||
? "circuit-breaker-open-manual-reset-required"
|
||||
: $"circuit-breaker-open-cooldown-{cooldown:g}");
|
||||
return;
|
||||
}
|
||||
// Successful crash recording — do not respawn synchronously; GetOrLaunchAsync will
|
||||
// pick up the attempt on the next call. Keeps the fan-out fast.
|
||||
}
|
||||
|
||||
/// <summary>Operator action — clear the sticky alert + reset the breaker.</summary>
|
||||
public void AcknowledgeAndReset()
|
||||
{
|
||||
_breaker.ManualReset();
|
||||
_backoff.RecordStableRun();
|
||||
}
|
||||
|
||||
private async Task<IFocasClient> LaunchWithBackoffAsync(CancellationToken ct)
|
||||
{
|
||||
while (true)
|
||||
{
|
||||
if (_breaker.StickyAlertActive)
|
||||
{
|
||||
if (!_breaker.TryRecordCrash(_clock(), out var cooldown) && cooldown == TimeSpan.MaxValue)
|
||||
throw new InvalidOperationException(
|
||||
"FOCAS Host circuit breaker is open and awaiting manual reset. " +
|
||||
"See Admin /hosts; call AcknowledgeAndReset after investigating the Host log.");
|
||||
}
|
||||
|
||||
try
|
||||
{
|
||||
_current = await _launcher.LaunchAsync(ct).ConfigureAwait(false);
|
||||
_currentStartedUtc = _clock();
|
||||
|
||||
// If the launch sequence itself takes long enough to count as a stable run,
|
||||
// reset the backoff ladder immediately.
|
||||
if (_clock() - _currentStartedUtc >= _backoff.StableRunThreshold)
|
||||
_backoff.RecordStableRun();
|
||||
|
||||
return _current;
|
||||
}
|
||||
catch (Exception ex) when (ex is not OperationCanceledException)
|
||||
{
|
||||
OnUnavailable?.Invoke($"launch-failed: {ex.Message}");
|
||||
ObservedCrashes++;
|
||||
if (!_breaker.TryRecordCrash(_clock(), out var cooldown))
|
||||
{
|
||||
var hint = cooldown == TimeSpan.MaxValue
|
||||
? "manual reset required"
|
||||
: $"cooldown {cooldown:g}";
|
||||
throw new InvalidOperationException(
|
||||
$"FOCAS Host circuit breaker opened after {ObservedCrashes} crashes — {hint}.", ex);
|
||||
}
|
||||
|
||||
var delay = _backoff.Next();
|
||||
await Task.Delay(delay, ct).ConfigureAwait(false);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>Called from the heartbeat loop after a successful ack run — relaxes the backoff ladder.</summary>
|
||||
public void NotifyStableRun()
|
||||
{
|
||||
if (_current is null) return;
|
||||
if (_clock() - _currentStartedUtc >= _backoff.StableRunThreshold)
|
||||
_backoff.RecordStableRun();
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
if (_disposed) return;
|
||||
_disposed = true;
|
||||
try { _launcher.TerminateAsync(CancellationToken.None).GetAwaiter().GetResult(); }
|
||||
catch { /* best effort */ }
|
||||
_current?.Dispose();
|
||||
_current = null;
|
||||
}
|
||||
|
||||
private void ThrowIfDisposed()
|
||||
{
|
||||
if (_disposed) throw new ObjectDisposedException(nameof(FocasHostSupervisor));
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,29 @@
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
/// <summary>
|
||||
/// Tracks missed heartbeats from the FOCAS Host. 2s cadence + 3 consecutive misses =
|
||||
/// host declared dead (~6s detection). Same defaults as Galaxy Tier-C so operators
|
||||
/// see the same cadence across hosts on the /hosts Admin page.
|
||||
/// </summary>
|
||||
public sealed class HeartbeatMonitor
|
||||
{
|
||||
public int MissesUntilDead { get; init; } = 3;
|
||||
|
||||
public TimeSpan Cadence { get; init; } = TimeSpan.FromSeconds(2);
|
||||
|
||||
public int ConsecutiveMisses { get; private set; }
|
||||
public DateTime? LastAckUtc { get; private set; }
|
||||
|
||||
public void RecordAck(DateTime utcNow)
|
||||
{
|
||||
ConsecutiveMisses = 0;
|
||||
LastAckUtc = utcNow;
|
||||
}
|
||||
|
||||
/// <summary>Records a missed heartbeat; returns <c>true</c> when the death threshold is crossed.</summary>
|
||||
public bool RecordMiss()
|
||||
{
|
||||
ConsecutiveMisses++;
|
||||
return ConsecutiveMisses >= MissesUntilDead;
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,32 @@
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
/// <summary>
|
||||
/// Abstraction over the act of spawning a FOCAS Host process and obtaining an
|
||||
/// <see cref="IFocasClient"/> connected to it. Production wires this to a real
|
||||
/// <c>Process.Start</c> + <c>FocasIpcClient.ConnectAsync</c>; tests use a fake that
|
||||
/// exposes deterministic failure modes so the supervisor logic can be stressed
|
||||
/// without spawning actual exes.
|
||||
/// </summary>
|
||||
public interface IHostProcessLauncher
|
||||
{
|
||||
/// <summary>
|
||||
/// Spawn a new Host process (if one isn't already running) and return a live
|
||||
/// client session. Throws on unrecoverable errors; transient errors (e.g. Host
|
||||
/// not ready yet) should throw <see cref="TimeoutException"/> so the supervisor
|
||||
/// applies the backoff ladder.
|
||||
/// </summary>
|
||||
Task<IFocasClient> LaunchAsync(CancellationToken ct);
|
||||
|
||||
/// <summary>
|
||||
/// Terminate the Host process if one is running. Called on Dispose and after a
|
||||
/// heartbeat loss is detected.
|
||||
/// </summary>
|
||||
Task TerminateAsync(CancellationToken ct);
|
||||
|
||||
/// <summary>
|
||||
/// <c>true</c> when the most recently spawned Host process is still alive.
|
||||
/// Supervisor polls this at heartbeat cadence; going <c>false</c> without a
|
||||
/// clean shutdown counts as a crash.
|
||||
/// </summary>
|
||||
bool IsProcessAlive { get; }
|
||||
}
|
||||
@@ -0,0 +1,57 @@
|
||||
using System.IO.MemoryMappedFiles;
|
||||
using System.Text;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
/// <summary>
|
||||
/// Proxy-side reader for the Host's post-mortem MMF. After a Host crash the supervisor
|
||||
/// opens the file (which persists beyond the process lifetime) and enumerates the last
|
||||
/// few thousand IPC operations that were in flight. Format matches
|
||||
/// <c>Driver.FOCAS.Host.Stability.PostMortemMmf</c> — magic 'OFPC' / 256-byte entries.
|
||||
/// </summary>
|
||||
public sealed class PostMortemReader
|
||||
{
|
||||
private const int Magic = 0x4F465043; // 'OFPC'
|
||||
private const int HeaderBytes = 16;
|
||||
private const int EntryBytes = 256;
|
||||
private const int MessageOffset = 16;
|
||||
private const int MessageCapacity = EntryBytes - MessageOffset;
|
||||
|
||||
public string Path { get; }
|
||||
|
||||
public PostMortemReader(string path) => Path = path;
|
||||
|
||||
public PostMortemEntry[] ReadAll()
|
||||
{
|
||||
if (!File.Exists(Path)) return [];
|
||||
|
||||
using var mmf = MemoryMappedFile.CreateFromFile(Path, FileMode.Open, null, 0, MemoryMappedFileAccess.Read);
|
||||
using var accessor = mmf.CreateViewAccessor(0, 0, MemoryMappedFileAccess.Read);
|
||||
|
||||
if (accessor.ReadInt32(0) != Magic) return [];
|
||||
|
||||
var capacity = accessor.ReadInt32(8);
|
||||
var writeIndex = accessor.ReadInt32(12);
|
||||
var entries = new PostMortemEntry[capacity];
|
||||
var count = 0;
|
||||
|
||||
for (var i = 0; i < capacity; i++)
|
||||
{
|
||||
var slot = (writeIndex + i) % capacity;
|
||||
var offset = HeaderBytes + slot * EntryBytes;
|
||||
var ts = accessor.ReadInt64(offset + 0);
|
||||
if (ts == 0) continue;
|
||||
var op = accessor.ReadInt64(offset + 8);
|
||||
var msgBuf = new byte[MessageCapacity];
|
||||
accessor.ReadArray(offset + MessageOffset, msgBuf, 0, MessageCapacity);
|
||||
var nulTerm = Array.IndexOf<byte>(msgBuf, 0);
|
||||
var msg = Encoding.UTF8.GetString(msgBuf, 0, nulTerm < 0 ? MessageCapacity : nulTerm);
|
||||
entries[count++] = new PostMortemEntry(ts, op, msg);
|
||||
}
|
||||
|
||||
Array.Resize(ref entries, count);
|
||||
return entries;
|
||||
}
|
||||
}
|
||||
|
||||
public readonly record struct PostMortemEntry(long UtcUnixMs, long OpKind, string Message);
|
||||
@@ -0,0 +1,113 @@
|
||||
using System.Diagnostics;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Ipc;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
/// <summary>
|
||||
/// Production <see cref="IHostProcessLauncher"/>. Spawns <c>OtOpcUa.Driver.FOCAS.Host.exe</c>
|
||||
/// with the pipe name / allowed-SID / per-spawn shared secret in the environment, waits for
|
||||
/// the pipe to come up, then connects a <see cref="FocasIpcClient"/> and wraps it in an
|
||||
/// <see cref="IpcFocasClient"/>. On <see cref="TerminateAsync"/> best-effort kills the
|
||||
/// process and closes the IPC stream.
|
||||
/// </summary>
|
||||
public sealed class ProcessHostLauncher : IHostProcessLauncher
|
||||
{
|
||||
private readonly ProcessHostLauncherOptions _options;
|
||||
private Process? _process;
|
||||
private FocasIpcClient? _ipc;
|
||||
|
||||
public ProcessHostLauncher(ProcessHostLauncherOptions options)
|
||||
{
|
||||
_options = options ?? throw new ArgumentNullException(nameof(options));
|
||||
}
|
||||
|
||||
public bool IsProcessAlive => _process is { HasExited: false };
|
||||
|
||||
public async Task<IFocasClient> LaunchAsync(CancellationToken ct)
|
||||
{
|
||||
await TerminateAsync(ct).ConfigureAwait(false);
|
||||
|
||||
var secret = _options.SharedSecret ?? Guid.NewGuid().ToString("N");
|
||||
|
||||
var psi = new ProcessStartInfo
|
||||
{
|
||||
FileName = _options.HostExePath,
|
||||
Arguments = _options.Arguments ?? string.Empty,
|
||||
UseShellExecute = false,
|
||||
CreateNoWindow = true,
|
||||
};
|
||||
psi.Environment["OTOPCUA_FOCAS_PIPE"] = _options.PipeName;
|
||||
psi.Environment["OTOPCUA_ALLOWED_SID"] = _options.AllowedSid;
|
||||
psi.Environment["OTOPCUA_FOCAS_SECRET"] = secret;
|
||||
psi.Environment["OTOPCUA_FOCAS_BACKEND"] = _options.Backend;
|
||||
|
||||
_process = Process.Start(psi)
|
||||
?? throw new InvalidOperationException($"Failed to start {_options.HostExePath}");
|
||||
|
||||
// Poll for pipe readiness up to the configured connect timeout.
|
||||
var deadline = DateTime.UtcNow + _options.ConnectTimeout;
|
||||
while (true)
|
||||
{
|
||||
ct.ThrowIfCancellationRequested();
|
||||
if (_process.HasExited)
|
||||
throw new InvalidOperationException(
|
||||
$"FOCAS Host exited before pipe was ready (ExitCode={_process.ExitCode}).");
|
||||
|
||||
try
|
||||
{
|
||||
_ipc = await FocasIpcClient.ConnectAsync(
|
||||
_options.PipeName, secret, TimeSpan.FromSeconds(1), ct).ConfigureAwait(false);
|
||||
break;
|
||||
}
|
||||
catch (TimeoutException)
|
||||
{
|
||||
if (DateTime.UtcNow >= deadline)
|
||||
throw new TimeoutException(
|
||||
$"FOCAS Host pipe {_options.PipeName} did not come up within {_options.ConnectTimeout:g}.");
|
||||
await Task.Delay(TimeSpan.FromMilliseconds(250), ct).ConfigureAwait(false);
|
||||
}
|
||||
}
|
||||
|
||||
return new IpcFocasClient(_ipc, _options.Series);
|
||||
}
|
||||
|
||||
public async Task TerminateAsync(CancellationToken ct)
|
||||
{
|
||||
if (_ipc is not null)
|
||||
{
|
||||
try { await _ipc.DisposeAsync().ConfigureAwait(false); }
|
||||
catch { /* best effort */ }
|
||||
_ipc = null;
|
||||
}
|
||||
|
||||
if (_process is not null)
|
||||
{
|
||||
try
|
||||
{
|
||||
if (!_process.HasExited)
|
||||
{
|
||||
_process.Kill(entireProcessTree: true);
|
||||
await _process.WaitForExitAsync(ct).ConfigureAwait(false);
|
||||
}
|
||||
}
|
||||
catch { /* best effort */ }
|
||||
finally
|
||||
{
|
||||
_process.Dispose();
|
||||
_process = null;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public sealed record ProcessHostLauncherOptions(
|
||||
string HostExePath,
|
||||
string PipeName,
|
||||
string AllowedSid)
|
||||
{
|
||||
public string? SharedSecret { get; init; }
|
||||
public string? Arguments { get; init; }
|
||||
public string Backend { get; init; } = "fwlib32";
|
||||
public TimeSpan ConnectTimeout { get; init; } = TimeSpan.FromSeconds(15);
|
||||
public FocasCncSeries Series { get; init; } = FocasCncSeries.Unknown;
|
||||
}
|
||||
@@ -0,0 +1,86 @@
|
||||
using System;
|
||||
using System.IO;
|
||||
using Shouldly;
|
||||
using Xunit;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host.Stability;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Host.Tests
|
||||
{
|
||||
[Trait("Category", "Unit")]
|
||||
public sealed class PostMortemMmfTests : IDisposable
|
||||
{
|
||||
private readonly string _tempPath;
|
||||
|
||||
public PostMortemMmfTests()
|
||||
{
|
||||
_tempPath = Path.Combine(Path.GetTempPath(), $"focas-mmf-{Guid.NewGuid():N}.bin");
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
if (File.Exists(_tempPath)) File.Delete(_tempPath);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Write_and_read_preserve_order_and_content()
|
||||
{
|
||||
using (var mmf = new PostMortemMmf(_tempPath, capacity: 10))
|
||||
{
|
||||
mmf.Write(opKind: 1, "read R100");
|
||||
mmf.Write(opKind: 2, "write MACRO:500 = 3.14");
|
||||
mmf.Write(opKind: 3, "probe ok");
|
||||
}
|
||||
|
||||
// Reopen (simulating a reader after the writer crashed).
|
||||
using var reader = new PostMortemMmf(_tempPath, capacity: 10);
|
||||
var entries = reader.ReadAll();
|
||||
entries.Length.ShouldBe(3);
|
||||
entries[0].OpKind.ShouldBe(1L);
|
||||
entries[0].Message.ShouldBe("read R100");
|
||||
entries[1].OpKind.ShouldBe(2L);
|
||||
entries[2].Message.ShouldBe("probe ok");
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Ring_buffer_wraps_at_capacity()
|
||||
{
|
||||
using var mmf = new PostMortemMmf(_tempPath, capacity: 3);
|
||||
for (var i = 0; i < 10; i++) mmf.Write(i, $"op-{i}");
|
||||
|
||||
var entries = mmf.ReadAll();
|
||||
entries.Length.ShouldBe(3);
|
||||
// Oldest surviving entry is op-7 (entries 7,8,9 survive in FIFO order).
|
||||
entries[0].Message.ShouldBe("op-7");
|
||||
entries[1].Message.ShouldBe("op-8");
|
||||
entries[2].Message.ShouldBe("op-9");
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Truncated_message_is_null_terminated_and_does_not_overflow()
|
||||
{
|
||||
using var mmf = new PostMortemMmf(_tempPath, capacity: 4);
|
||||
var big = new string('x', 500); // longer than the 240-byte message capacity
|
||||
mmf.Write(42, big);
|
||||
|
||||
var entries = mmf.ReadAll();
|
||||
entries.Length.ShouldBe(1);
|
||||
entries[0].Message.Length.ShouldBeLessThanOrEqualTo(240);
|
||||
entries[0].OpKind.ShouldBe(42L);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Reopening_with_existing_data_preserves_entries()
|
||||
{
|
||||
using (var first = new PostMortemMmf(_tempPath, capacity: 5))
|
||||
{
|
||||
first.Write(1, "first-run-1");
|
||||
first.Write(2, "first-run-2");
|
||||
}
|
||||
|
||||
using var second = new PostMortemMmf(_tempPath, capacity: 5);
|
||||
var entries = second.ReadAll();
|
||||
entries.Length.ShouldBe(2);
|
||||
entries[0].Message.ShouldBe("first-run-1");
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,84 @@
|
||||
using System.IO.MemoryMappedFiles;
|
||||
using System.Text;
|
||||
using Shouldly;
|
||||
using Xunit;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests;
|
||||
|
||||
/// <summary>
|
||||
/// The Proxy-side <see cref="PostMortemReader"/> must read the Host's MMF format
|
||||
/// (magic 'OFPC', 256-byte entries). This test writes a hand-crafted file that mimics
|
||||
/// the Host's layout exactly + asserts the reader decodes it correctly. Keeps the two
|
||||
/// codebases in lockstep on the wire format without needing to reference the net48
|
||||
/// Host assembly from the net10 test project.
|
||||
/// </summary>
|
||||
[Trait("Category", "Unit")]
|
||||
public sealed class PostMortemReaderCompatibilityTests : IDisposable
|
||||
{
|
||||
private readonly string _tempPath = Path.Combine(Path.GetTempPath(), $"focas-mmf-compat-{Guid.NewGuid():N}.bin");
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
if (File.Exists(_tempPath)) File.Delete(_tempPath);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Reader_parses_host_format_and_returns_entries_in_oldest_first_order()
|
||||
{
|
||||
const int magic = 0x4F465043;
|
||||
const int capacity = 5;
|
||||
const int headerBytes = 16;
|
||||
const int entryBytes = 256;
|
||||
const int messageOffset = 16;
|
||||
var fileBytes = headerBytes + capacity * entryBytes;
|
||||
|
||||
using (var fs = new FileStream(_tempPath, FileMode.CreateNew, FileAccess.ReadWrite, FileShare.Read))
|
||||
{
|
||||
fs.SetLength(fileBytes);
|
||||
using var mmf = MemoryMappedFile.CreateFromFile(fs, null, fileBytes,
|
||||
MemoryMappedFileAccess.ReadWrite, HandleInheritability.None, leaveOpen: false);
|
||||
using var acc = mmf.CreateViewAccessor(0, fileBytes, MemoryMappedFileAccess.ReadWrite);
|
||||
acc.Write(0, magic);
|
||||
acc.Write(4, 1);
|
||||
acc.Write(8, capacity);
|
||||
acc.Write(12, 2); // writeIndex — next write would land at slot 2
|
||||
|
||||
void WriteEntry(int slot, long ts, long op, string msg)
|
||||
{
|
||||
var offset = headerBytes + slot * entryBytes;
|
||||
acc.Write(offset + 0, ts);
|
||||
acc.Write(offset + 8, op);
|
||||
var bytes = Encoding.UTF8.GetBytes(msg);
|
||||
acc.WriteArray(offset + messageOffset, bytes, 0, bytes.Length);
|
||||
acc.Write(offset + messageOffset + bytes.Length, (byte)0);
|
||||
}
|
||||
|
||||
WriteEntry(0, 100, 1, "op-a");
|
||||
WriteEntry(1, 200, 2, "op-b");
|
||||
// Slots 2,3 unwritten (ts=0) — reader must skip.
|
||||
WriteEntry(4, 50, 9, "old-wrapped");
|
||||
}
|
||||
|
||||
var entries = new PostMortemReader(_tempPath).ReadAll();
|
||||
entries.Length.ShouldBe(3);
|
||||
// writeIndex=2 means the ring walk starts at slot 2, so iteration order is 2→3→4→0→1.
|
||||
// Slots 2 and 3 are empty; 4 yields "old-wrapped"; then 0="op-a", 1="op-b".
|
||||
entries[0].Message.ShouldBe("old-wrapped");
|
||||
entries[1].Message.ShouldBe("op-a");
|
||||
entries[2].Message.ShouldBe("op-b");
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Reader_returns_empty_when_file_missing()
|
||||
{
|
||||
new PostMortemReader(_tempPath + "-does-not-exist").ReadAll().ShouldBeEmpty();
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Reader_returns_empty_when_magic_mismatches()
|
||||
{
|
||||
File.WriteAllBytes(_tempPath, new byte[1024]);
|
||||
new PostMortemReader(_tempPath).ReadAll().ShouldBeEmpty();
|
||||
}
|
||||
}
|
||||
249
tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/SupervisorTests.cs
Normal file
249
tests/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/SupervisorTests.cs
Normal file
@@ -0,0 +1,249 @@
|
||||
using Shouldly;
|
||||
using Xunit;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Supervisor;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests;
|
||||
|
||||
[Trait("Category", "Unit")]
|
||||
public sealed class BackoffTests
|
||||
{
|
||||
[Fact]
|
||||
public void Default_sequence_is_5s_15s_60s_then_clamped()
|
||||
{
|
||||
var b = new Backoff();
|
||||
b.Next().ShouldBe(TimeSpan.FromSeconds(5));
|
||||
b.Next().ShouldBe(TimeSpan.FromSeconds(15));
|
||||
b.Next().ShouldBe(TimeSpan.FromSeconds(60));
|
||||
b.Next().ShouldBe(TimeSpan.FromSeconds(60));
|
||||
b.Next().ShouldBe(TimeSpan.FromSeconds(60));
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void RecordStableRun_resets_the_ladder_to_the_start()
|
||||
{
|
||||
var b = new Backoff();
|
||||
b.Next(); b.Next();
|
||||
b.AttemptIndex.ShouldBe(2);
|
||||
b.RecordStableRun();
|
||||
b.AttemptIndex.ShouldBe(0);
|
||||
b.Next().ShouldBe(TimeSpan.FromSeconds(5));
|
||||
}
|
||||
}
|
||||
|
||||
[Trait("Category", "Unit")]
|
||||
public sealed class CircuitBreakerTests
|
||||
{
|
||||
[Fact]
|
||||
public void Allows_crashes_below_threshold()
|
||||
{
|
||||
var b = new CircuitBreaker();
|
||||
var now = DateTime.UtcNow;
|
||||
b.TryRecordCrash(now, out _).ShouldBeTrue();
|
||||
b.TryRecordCrash(now.AddSeconds(1), out _).ShouldBeTrue();
|
||||
b.TryRecordCrash(now.AddSeconds(2), out _).ShouldBeTrue();
|
||||
b.StickyAlertActive.ShouldBeFalse();
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Opens_when_exceeding_threshold_in_window()
|
||||
{
|
||||
var b = new CircuitBreaker();
|
||||
var now = DateTime.UtcNow;
|
||||
b.TryRecordCrash(now, out _);
|
||||
b.TryRecordCrash(now.AddSeconds(1), out _);
|
||||
b.TryRecordCrash(now.AddSeconds(2), out _);
|
||||
b.TryRecordCrash(now.AddSeconds(3), out var cooldown).ShouldBeFalse();
|
||||
cooldown.ShouldBe(TimeSpan.FromHours(1));
|
||||
b.StickyAlertActive.ShouldBeTrue();
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Escalates_cooldown_after_second_open()
|
||||
{
|
||||
var b = new CircuitBreaker();
|
||||
var t0 = new DateTime(2026, 1, 1, 0, 0, 0, DateTimeKind.Utc);
|
||||
// First burst — 4 crashes opens breaker with 1h cooldown.
|
||||
for (var i = 0; i < 4; i++) b.TryRecordCrash(t0.AddSeconds(i), out _);
|
||||
b.StickyAlertActive.ShouldBeTrue();
|
||||
|
||||
// Wait past cooldown. The first crash after cooldown-elapsed resets _openSinceUtc and
|
||||
// bumps escalation level; the next 3 crashes then re-open with the escalated 4h cooldown.
|
||||
b.TryRecordCrash(t0.AddHours(1).AddMinutes(1), out _);
|
||||
var t1 = t0.AddHours(1).AddMinutes(1).AddSeconds(1);
|
||||
b.TryRecordCrash(t1, out _);
|
||||
b.TryRecordCrash(t1.AddSeconds(1), out _);
|
||||
b.TryRecordCrash(t1.AddSeconds(2), out var cooldown).ShouldBeFalse();
|
||||
cooldown.ShouldBe(TimeSpan.FromHours(4));
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void ManualReset_clears_everything()
|
||||
{
|
||||
var b = new CircuitBreaker();
|
||||
var now = DateTime.UtcNow;
|
||||
for (var i = 0; i < 5; i++) b.TryRecordCrash(now.AddSeconds(i), out _);
|
||||
b.StickyAlertActive.ShouldBeTrue();
|
||||
b.ManualReset();
|
||||
b.StickyAlertActive.ShouldBeFalse();
|
||||
b.TryRecordCrash(now.AddSeconds(10), out _).ShouldBeTrue();
|
||||
}
|
||||
}
|
||||
|
||||
[Trait("Category", "Unit")]
|
||||
public sealed class HeartbeatMonitorTests
|
||||
{
|
||||
[Fact]
|
||||
public void Three_consecutive_misses_declares_dead()
|
||||
{
|
||||
var m = new HeartbeatMonitor();
|
||||
m.RecordMiss().ShouldBeFalse();
|
||||
m.RecordMiss().ShouldBeFalse();
|
||||
m.RecordMiss().ShouldBeTrue();
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public void Ack_resets_the_miss_counter()
|
||||
{
|
||||
var m = new HeartbeatMonitor();
|
||||
m.RecordMiss(); m.RecordMiss();
|
||||
m.ConsecutiveMisses.ShouldBe(2);
|
||||
m.RecordAck(DateTime.UtcNow);
|
||||
m.ConsecutiveMisses.ShouldBe(0);
|
||||
}
|
||||
}
|
||||
|
||||
[Trait("Category", "Unit")]
|
||||
public sealed class FocasHostSupervisorTests
|
||||
{
|
||||
private sealed class FakeLauncher : IHostProcessLauncher
|
||||
{
|
||||
public int LaunchAttempts { get; private set; }
|
||||
public int Terminations { get; private set; }
|
||||
public Queue<Func<IFocasClient>> Plan { get; } = new();
|
||||
public bool IsProcessAlive { get; set; }
|
||||
|
||||
public Task<IFocasClient> LaunchAsync(CancellationToken ct)
|
||||
{
|
||||
LaunchAttempts++;
|
||||
if (Plan.Count == 0) throw new InvalidOperationException("FakeLauncher plan exhausted");
|
||||
var next = Plan.Dequeue()();
|
||||
IsProcessAlive = true;
|
||||
return Task.FromResult(next);
|
||||
}
|
||||
|
||||
public Task TerminateAsync(CancellationToken ct)
|
||||
{
|
||||
Terminations++;
|
||||
IsProcessAlive = false;
|
||||
return Task.CompletedTask;
|
||||
}
|
||||
}
|
||||
|
||||
private sealed class StubFocasClient : IFocasClient
|
||||
{
|
||||
public bool IsConnected => true;
|
||||
public Task ConnectAsync(FocasHostAddress address, TimeSpan timeout, CancellationToken ct) => Task.CompletedTask;
|
||||
public Task<(object? value, uint status)> ReadAsync(FocasAddress a, FocasDataType t, CancellationToken ct) =>
|
||||
Task.FromResult<(object?, uint)>((0, 0));
|
||||
public Task<uint> WriteAsync(FocasAddress a, FocasDataType t, object? v, CancellationToken ct) => Task.FromResult(0u);
|
||||
public Task<bool> ProbeAsync(CancellationToken ct) => Task.FromResult(true);
|
||||
public void Dispose() { }
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task GetOrLaunch_returns_client_on_first_success()
|
||||
{
|
||||
var launcher = new FakeLauncher();
|
||||
launcher.Plan.Enqueue(() => new StubFocasClient());
|
||||
var supervisor = new FocasHostSupervisor(launcher);
|
||||
var client = await supervisor.GetOrLaunchAsync(TestContext.Current.CancellationToken);
|
||||
client.ShouldNotBeNull();
|
||||
launcher.LaunchAttempts.ShouldBe(1);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task GetOrLaunch_retries_after_transient_failure_with_backoff()
|
||||
{
|
||||
var launcher = new FakeLauncher();
|
||||
launcher.Plan.Enqueue(() => throw new TimeoutException("pipe not ready"));
|
||||
launcher.Plan.Enqueue(() => new StubFocasClient());
|
||||
|
||||
var backoff = new Backoff([TimeSpan.FromMilliseconds(10), TimeSpan.FromMilliseconds(20)]);
|
||||
var supervisor = new FocasHostSupervisor(launcher, backoff);
|
||||
|
||||
var unavailableMessages = new List<string>();
|
||||
supervisor.OnUnavailable += m => unavailableMessages.Add(m);
|
||||
|
||||
var client = await supervisor.GetOrLaunchAsync(TestContext.Current.CancellationToken);
|
||||
client.ShouldNotBeNull();
|
||||
launcher.LaunchAttempts.ShouldBe(2);
|
||||
unavailableMessages.Count.ShouldBe(1);
|
||||
unavailableMessages[0].ShouldContain("launch-failed");
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task Repeated_launch_failures_open_breaker_and_surface_InvalidOperation()
|
||||
{
|
||||
var launcher = new FakeLauncher();
|
||||
for (var i = 0; i < 10; i++)
|
||||
launcher.Plan.Enqueue(() => throw new InvalidOperationException("simulated host refused"));
|
||||
|
||||
var supervisor = new FocasHostSupervisor(
|
||||
launcher,
|
||||
backoff: new Backoff([TimeSpan.FromMilliseconds(1)]),
|
||||
breaker: new CircuitBreaker { CrashesAllowedPerWindow = 2, Window = TimeSpan.FromMinutes(5) });
|
||||
|
||||
var ex = await Should.ThrowAsync<InvalidOperationException>(async () =>
|
||||
await supervisor.GetOrLaunchAsync(TestContext.Current.CancellationToken));
|
||||
ex.Message.ShouldContain("circuit breaker");
|
||||
supervisor.StickyAlertActive.ShouldBeTrue();
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task NotifyHostDeadAsync_terminates_current_and_fans_out_unavailable()
|
||||
{
|
||||
var launcher = new FakeLauncher();
|
||||
launcher.Plan.Enqueue(() => new StubFocasClient());
|
||||
var supervisor = new FocasHostSupervisor(launcher);
|
||||
|
||||
var messages = new List<string>();
|
||||
supervisor.OnUnavailable += m => messages.Add(m);
|
||||
await supervisor.GetOrLaunchAsync(TestContext.Current.CancellationToken);
|
||||
|
||||
await supervisor.NotifyHostDeadAsync("heartbeat-loss", TestContext.Current.CancellationToken);
|
||||
|
||||
launcher.Terminations.ShouldBe(1);
|
||||
messages.ShouldContain("heartbeat-loss");
|
||||
supervisor.ObservedCrashes.ShouldBe(1);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task AcknowledgeAndReset_clears_sticky_alert()
|
||||
{
|
||||
var launcher = new FakeLauncher();
|
||||
for (var i = 0; i < 10; i++)
|
||||
launcher.Plan.Enqueue(() => throw new InvalidOperationException("refused"));
|
||||
var supervisor = new FocasHostSupervisor(
|
||||
launcher,
|
||||
backoff: new Backoff([TimeSpan.FromMilliseconds(1)]),
|
||||
breaker: new CircuitBreaker { CrashesAllowedPerWindow = 1 });
|
||||
|
||||
try { await supervisor.GetOrLaunchAsync(TestContext.Current.CancellationToken); } catch { }
|
||||
supervisor.StickyAlertActive.ShouldBeTrue();
|
||||
|
||||
supervisor.AcknowledgeAndReset();
|
||||
supervisor.StickyAlertActive.ShouldBeFalse();
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task Dispose_terminates_host_process()
|
||||
{
|
||||
var launcher = new FakeLauncher();
|
||||
launcher.Plan.Enqueue(() => new StubFocasClient());
|
||||
var supervisor = new FocasHostSupervisor(launcher);
|
||||
await supervisor.GetOrLaunchAsync(TestContext.Current.CancellationToken);
|
||||
|
||||
supervisor.Dispose();
|
||||
launcher.Terminations.ShouldBe(1);
|
||||
}
|
||||
}
|
||||
@@ -15,6 +15,12 @@ RUN pip install --no-cache-dir "pymodbus[simulator]==3.13.0"
|
||||
WORKDIR /fixtures
|
||||
COPY profiles/ /fixtures/
|
||||
|
||||
# Standalone exception-injection server (pure Python stdlib — no pymodbus
|
||||
# dependency). Speaks raw Modbus/TCP and emits arbitrary exception codes
|
||||
# per rules in exception_injection.json. Drives the `exception_injection`
|
||||
# compose profile. See Docker/README.md §exception injection.
|
||||
COPY exception_injector.py /fixtures/
|
||||
|
||||
EXPOSE 5020
|
||||
|
||||
# Default to the standard profile; docker-compose.yml overrides per service.
|
||||
|
||||
@@ -9,9 +9,10 @@ nothing else.
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| [`Dockerfile`](Dockerfile) | `python:3.12-slim-bookworm` + `pymodbus[simulator]==3.13.0` + the four profile JSONs |
|
||||
| [`docker-compose.yml`](docker-compose.yml) | One service per profile (`standard` / `dl205` / `mitsubishi` / `s7_1500`); all bind `:5020` so only one runs at a time |
|
||||
| [`Dockerfile`](Dockerfile) | `python:3.12-slim-bookworm` + `pymodbus[simulator]==3.13.0` + every profile JSON + `exception_injector.py` |
|
||||
| [`docker-compose.yml`](docker-compose.yml) | One service per profile (`standard` / `dl205` / `mitsubishi` / `s7_1500` / `exception_injection`); all bind `:5020` so only one runs at a time |
|
||||
| [`profiles/*.json`](profiles/) | Same seed-register definitions the native launcher uses — canonical source |
|
||||
| [`exception_injector.py`](exception_injector.py) | Pure-stdlib Modbus/TCP server that emits arbitrary exception codes per rule — used by the `exception_injection` profile |
|
||||
|
||||
## Run
|
||||
|
||||
@@ -29,6 +30,10 @@ docker compose -f tests\ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests\Docker\
|
||||
|
||||
# Siemens S7-1500 MB_SERVER quirks
|
||||
docker compose -f tests\ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests\Docker\docker-compose.yml --profile s7_1500 up
|
||||
|
||||
# Exception-injection — end-to-end coverage of every Modbus exception code
|
||||
# (01/02/03/04/05/06/0A/0B), not just the 02 + 03 pymodbus emits naturally
|
||||
docker compose -f tests\ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests\Docker\docker-compose.yml --profile exception_injection up
|
||||
```
|
||||
|
||||
Detached + stop:
|
||||
@@ -61,6 +66,36 @@ dotnet test tests\ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests
|
||||
records a `SkipReason` when unreachable, so tests stay green on a fresh
|
||||
clone without Docker running.
|
||||
|
||||
## Exception injection
|
||||
|
||||
pymodbus's simulator naturally emits only Modbus exception codes `0x02`
|
||||
(Illegal Data Address, on reads outside its configured ranges) and
|
||||
`0x03` (Illegal Data Value, on over-length requests). The driver's
|
||||
`MapModbusExceptionToStatus` table translates eight codes: `0x01`,
|
||||
`0x02`, `0x03`, `0x04`, `0x05`, `0x06`, `0x0A`, `0x0B`. Unit tests
|
||||
lock the translation function; the integration side previously only
|
||||
proved the wire-to-status path for `0x02`.
|
||||
|
||||
The `exception_injection` profile runs
|
||||
[`exception_injector.py`](exception_injector.py) — a tiny standalone
|
||||
Modbus/TCP server written against the Python stdlib (zero
|
||||
dependencies outside what's in the base image). It speaks the wire
|
||||
protocol directly (FC 01/02/03/04/05/06/15/16) and looks up each
|
||||
incoming `(fc, address)` against the rules in
|
||||
[`profiles/exception_injection.json`](profiles/exception_injection.json);
|
||||
a matching rule makes the server reply with
|
||||
`[fc | 0x80, exception_code]` instead of the normal response.
|
||||
|
||||
Current rules (see the JSON file for the canonical list):
|
||||
|
||||
- `FC03 @1000..1007` — one per exception code (`0x01`/`0x02`/`0x03`/`0x04`/`0x05`/`0x06`/`0x0A`/`0x0B`)
|
||||
- `FC06 @2000..2001` — `0x04` Server Failure, `0x06` Server Busy (write-path coverage)
|
||||
- `FC16 @3000` — `0x04` Server Failure (multi-register write path)
|
||||
|
||||
Adding more coverage is append-only: drop a new `{fc, address,
|
||||
exception, description}` entry into the JSON, restart the service,
|
||||
add an `[InlineData]` row in `ExceptionInjectionTests`.
|
||||
|
||||
## References
|
||||
|
||||
- [`docs/drivers/Modbus-Test-Fixture.md`](../../../docs/drivers/Modbus-Test-Fixture.md) — coverage map + gap inventory
|
||||
|
||||
@@ -77,3 +77,24 @@ services:
|
||||
"--modbus_device", "dev",
|
||||
"--json_file", "/fixtures/s7_1500.json"
|
||||
]
|
||||
|
||||
# Exception-injection profile. Runs the standalone pure-stdlib Modbus/TCP
|
||||
# server shipped as exception_injector.py instead of the pymodbus
|
||||
# simulator — pymodbus naturally emits only exception codes 02 + 03, and
|
||||
# this profile extends integration coverage to the other codes the
|
||||
# driver's MapModbusExceptionToStatus table handles (01, 04, 05, 06,
|
||||
# 0A, 0B). Rules are driven by exception_injection.json.
|
||||
exception_injection:
|
||||
profiles: ["exception_injection"]
|
||||
image: otopcua-pymodbus:3.13.0
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
container_name: otopcua-modbus-exception-injector
|
||||
restart: "no"
|
||||
ports:
|
||||
- "5020:5020"
|
||||
command: [
|
||||
"python", "/fixtures/exception_injector.py",
|
||||
"--config", "/fixtures/exception_injection.json"
|
||||
]
|
||||
|
||||
@@ -0,0 +1,261 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Minimal Modbus/TCP server that supports per-address + per-function-code
|
||||
exception injection — the missing piece of the pymodbus simulator, which
|
||||
only naturally emits exception code 02 (Illegal Data Address) via its
|
||||
"invalid" list and 03 (Illegal Data Value) via spec-enforced length caps.
|
||||
|
||||
Integration tests against this fixture drive the driver's
|
||||
`MapModbusExceptionToStatus` end-to-end over the wire for codes 01, 04,
|
||||
05, 06, 0A, 0B — the ones the pymodbus simulator can't be configured to
|
||||
return.
|
||||
|
||||
Wire protocol — straight Modbus/TCP (spec chapter 7.1):
|
||||
|
||||
MBAP header (7 bytes): [tx_id:u16 BE][proto=0:u16][length:u16][unit_id:u8]
|
||||
then length-1 bytes of PDU. Length covers unit_id + PDU.
|
||||
|
||||
Supported function codes (enough for the driver's RMW + read paths):
|
||||
01 Read Coils, 02 Read Discrete Inputs,
|
||||
03 Read Holding Registers, 04 Read Input Registers,
|
||||
05 Write Single Coil, 06 Write Single Register,
|
||||
15 Write Multiple Coils, 16 Write Multiple Registers.
|
||||
|
||||
Config JSON schema (see exception_injection.json):
|
||||
|
||||
{
|
||||
"listen": { "host": "0.0.0.0", "port": 5020 },
|
||||
"seeds": { "hr": { "<addr>": <uint16>, ... },
|
||||
"ir": { "<addr>": <uint16>, ... },
|
||||
"co": { "<addr>": <0|1>, ... },
|
||||
"di": { "<addr>": <0|1>, ... } },
|
||||
"rules": [ { "fc": <int>, "address": <int>, "exception": <int>,
|
||||
"description": "..." }, ... ]
|
||||
}
|
||||
|
||||
Rules match on (fc, starting address). A matching rule wins and the server
|
||||
responds with the PDU `[fc | 0x80, exception_code]`.
|
||||
|
||||
Zero runtime dependencies outside the Python stdlib so the Docker image
|
||||
stays tiny.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import struct
|
||||
import sys
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
log = logging.getLogger("exception_injector")
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Rule:
|
||||
fc: int
|
||||
address: int
|
||||
exception: int
|
||||
description: str = ""
|
||||
|
||||
|
||||
class Store:
|
||||
"""In-memory data store backing non-injected reads + writes."""
|
||||
|
||||
def __init__(self, seeds: dict[str, dict[str, int]]) -> None:
|
||||
self.hr: dict[int, int] = {int(k): int(v) for k, v in seeds.get("hr", {}).items()}
|
||||
self.ir: dict[int, int] = {int(k): int(v) for k, v in seeds.get("ir", {}).items()}
|
||||
self.co: dict[int, int] = {int(k): int(v) for k, v in seeds.get("co", {}).items()}
|
||||
self.di: dict[int, int] = {int(k): int(v) for k, v in seeds.get("di", {}).items()}
|
||||
|
||||
def read_bits(self, table: dict[int, int], addr: int, count: int) -> bytes:
|
||||
"""Pack `count` bits LSB-first into the Modbus bit response body."""
|
||||
bits = [table.get(addr + i, 0) & 1 for i in range(count)]
|
||||
out = bytearray((count + 7) // 8)
|
||||
for i, b in enumerate(bits):
|
||||
if b:
|
||||
out[i // 8] |= 1 << (i % 8)
|
||||
return bytes(out)
|
||||
|
||||
def read_regs(self, table: dict[int, int], addr: int, count: int) -> bytes:
|
||||
"""Pack `count` uint16 BE into the Modbus register response body."""
|
||||
return b"".join(struct.pack(">H", table.get(addr + i, 0) & 0xFFFF) for i in range(count))
|
||||
|
||||
|
||||
class Server:
|
||||
EXC_ILLEGAL_FUNCTION = 0x01
|
||||
EXC_ILLEGAL_DATA_ADDRESS = 0x02
|
||||
EXC_ILLEGAL_DATA_VALUE = 0x03
|
||||
|
||||
def __init__(self, store: Store, rules: list[Rule]) -> None:
|
||||
self._store = store
|
||||
# Index rules by (fc, address) for O(1) lookup.
|
||||
self._rules: dict[tuple[int, int], Rule] = {(r.fc, r.address): r for r in rules}
|
||||
|
||||
def lookup_rule(self, fc: int, address: int) -> Rule | None:
|
||||
return self._rules.get((fc, address))
|
||||
|
||||
def exception_pdu(self, fc: int, code: int) -> bytes:
|
||||
return bytes([fc | 0x80, code & 0xFF])
|
||||
|
||||
def handle_pdu(self, pdu: bytes) -> bytes:
|
||||
if not pdu:
|
||||
return self.exception_pdu(0, self.EXC_ILLEGAL_FUNCTION)
|
||||
|
||||
fc = pdu[0]
|
||||
|
||||
# Reads: FC 01/02/03/04 — [fc u8][addr u16][quantity u16]
|
||||
if fc in (0x01, 0x02, 0x03, 0x04):
|
||||
if len(pdu) != 5:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
addr, count = struct.unpack(">HH", pdu[1:5])
|
||||
|
||||
rule = self.lookup_rule(fc, addr)
|
||||
if rule is not None:
|
||||
log.info("inject fc=%d addr=%d -> exception 0x%02X (%s)",
|
||||
fc, addr, rule.exception, rule.description)
|
||||
return self.exception_pdu(fc, rule.exception)
|
||||
|
||||
# Spec caps — FC01/02 allow 1..2000 bits; FC03/04 allow 1..125 regs.
|
||||
if fc in (0x01, 0x02):
|
||||
if not 1 <= count <= 2000:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
body = self._store.read_bits(
|
||||
self._store.co if fc == 0x01 else self._store.di, addr, count)
|
||||
return bytes([fc, len(body)]) + body
|
||||
|
||||
if not 1 <= count <= 125:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
body = self._store.read_regs(
|
||||
self._store.hr if fc == 0x03 else self._store.ir, addr, count)
|
||||
return bytes([fc, len(body)]) + body
|
||||
|
||||
# FC05 — [fc u8][addr u16][value u16] where value is 0xFF00=ON or 0x0000=OFF.
|
||||
if fc == 0x05:
|
||||
if len(pdu) != 5:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
addr, value = struct.unpack(">HH", pdu[1:5])
|
||||
rule = self.lookup_rule(fc, addr)
|
||||
if rule is not None:
|
||||
return self.exception_pdu(fc, rule.exception)
|
||||
if value == 0xFF00:
|
||||
self._store.co[addr] = 1
|
||||
elif value == 0x0000:
|
||||
self._store.co[addr] = 0
|
||||
else:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
return pdu # FC05 echoes the request on success.
|
||||
|
||||
# FC06 — [fc u8][addr u16][value u16].
|
||||
if fc == 0x06:
|
||||
if len(pdu) != 5:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
addr, value = struct.unpack(">HH", pdu[1:5])
|
||||
rule = self.lookup_rule(fc, addr)
|
||||
if rule is not None:
|
||||
return self.exception_pdu(fc, rule.exception)
|
||||
self._store.hr[addr] = value
|
||||
return pdu # FC06 echoes on success.
|
||||
|
||||
# FC15 — [fc u8][addr u16][count u16][byte_count u8][values...]
|
||||
if fc == 0x0F:
|
||||
if len(pdu) < 6:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
addr, count = struct.unpack(">HH", pdu[1:5])
|
||||
rule = self.lookup_rule(fc, addr)
|
||||
if rule is not None:
|
||||
return self.exception_pdu(fc, rule.exception)
|
||||
# Happy-path ignore-the-data, ack with standard response.
|
||||
return struct.pack(">BHH", fc, addr, count)
|
||||
|
||||
# FC16 — [fc u8][addr u16][count u16][byte_count u8][u16 values...]
|
||||
if fc == 0x10:
|
||||
if len(pdu) < 6:
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_DATA_VALUE)
|
||||
addr, count = struct.unpack(">HH", pdu[1:5])
|
||||
rule = self.lookup_rule(fc, addr)
|
||||
if rule is not None:
|
||||
return self.exception_pdu(fc, rule.exception)
|
||||
byte_count = pdu[5]
|
||||
data = pdu[6:6 + byte_count]
|
||||
for i in range(count):
|
||||
self._store.hr[addr + i] = struct.unpack(">H", data[i * 2:i * 2 + 2])[0]
|
||||
return struct.pack(">BHH", fc, addr, count)
|
||||
|
||||
return self.exception_pdu(fc, self.EXC_ILLEGAL_FUNCTION)
|
||||
|
||||
async def handle_connection(self, reader: asyncio.StreamReader, writer: asyncio.StreamWriter) -> None:
|
||||
peer = writer.get_extra_info("peername")
|
||||
log.info("client connected from %s", peer)
|
||||
try:
|
||||
while True:
|
||||
hdr = await reader.readexactly(7)
|
||||
tx_id, proto, length, unit_id = struct.unpack(">HHHB", hdr)
|
||||
if length < 1:
|
||||
return
|
||||
pdu = await reader.readexactly(length - 1)
|
||||
|
||||
resp = self.handle_pdu(pdu)
|
||||
out = struct.pack(">HHHB", tx_id, proto, len(resp) + 1, unit_id) + resp
|
||||
writer.write(out)
|
||||
await writer.drain()
|
||||
except asyncio.IncompleteReadError:
|
||||
log.info("client %s disconnected", peer)
|
||||
except Exception: # pylint: disable=broad-except
|
||||
log.exception("unexpected error serving %s", peer)
|
||||
finally:
|
||||
try:
|
||||
writer.close()
|
||||
await writer.wait_closed()
|
||||
except Exception: # pylint: disable=broad-except
|
||||
pass
|
||||
|
||||
|
||||
def load_config(path: str) -> tuple[Store, list[Rule], str, int]:
|
||||
with open(path, "r", encoding="utf-8") as fh:
|
||||
raw = json.load(fh)
|
||||
listen = raw.get("listen", {})
|
||||
host = listen.get("host", "0.0.0.0")
|
||||
port = int(listen.get("port", 5020))
|
||||
store = Store(raw.get("seeds", {}))
|
||||
rules = [
|
||||
Rule(
|
||||
fc=int(r["fc"]),
|
||||
address=int(r["address"]),
|
||||
exception=int(r["exception"]),
|
||||
description=str(r.get("description", "")),
|
||||
)
|
||||
for r in raw.get("rules", [])
|
||||
]
|
||||
return store, rules, host, port
|
||||
|
||||
|
||||
async def main(argv: list[str]) -> int:
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument("--config", required=True, help="Path to exception-injection JSON config.")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
logging.basicConfig(level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(name)s - %(message)s")
|
||||
|
||||
store, rules, host, port = load_config(args.config)
|
||||
server = Server(store, rules)
|
||||
listener = await asyncio.start_server(server.handle_connection, host, port)
|
||||
|
||||
log.info("exception-injector listening on %s:%d with %d rule(s)", host, port, len(rules))
|
||||
for r in rules:
|
||||
log.info(" rule: fc=%d addr=%d -> exception 0x%02X (%s)",
|
||||
r.fc, r.address, r.exception, r.description)
|
||||
|
||||
async with listener:
|
||||
await listener.serve_forever()
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
sys.exit(asyncio.run(main(sys.argv[1:])))
|
||||
except KeyboardInterrupt:
|
||||
sys.exit(0)
|
||||
@@ -0,0 +1,34 @@
|
||||
{
|
||||
"_comment": "Modbus exception-injection profile — feeds exception_injector.py (not pymodbus). Rules match by (fc, address). HR[0-31] are address-as-value for the happy-path reads; HR[1000..1010] + coils[2000..2010] carry per-exception-code rules. Every code in the driver's MapModbusExceptionToStatus table that pymodbus can't naturally emit has a dedicated slot. See Docker/README.md §exception injection.",
|
||||
|
||||
"listen": { "host": "0.0.0.0", "port": 5020 },
|
||||
|
||||
"seeds": {
|
||||
"hr": {
|
||||
"0": 0, "1": 1, "2": 2, "3": 3,
|
||||
"4": 4, "5": 5, "6": 6, "7": 7,
|
||||
"8": 8, "9": 9, "10": 10, "11": 11,
|
||||
"12": 12, "13": 13, "14": 14, "15": 15,
|
||||
"16": 16, "17": 17, "18": 18, "19": 19,
|
||||
"20": 20, "21": 21, "22": 22, "23": 23,
|
||||
"24": 24, "25": 25, "26": 26, "27": 27,
|
||||
"28": 28, "29": 29, "30": 30, "31": 31
|
||||
}
|
||||
},
|
||||
|
||||
"rules": [
|
||||
{ "fc": 3, "address": 1000, "exception": 1, "description": "FC03 @1000 -> Illegal Function (0x01)" },
|
||||
{ "fc": 3, "address": 1001, "exception": 2, "description": "FC03 @1001 -> Illegal Data Address (0x02)" },
|
||||
{ "fc": 3, "address": 1002, "exception": 3, "description": "FC03 @1002 -> Illegal Data Value (0x03)" },
|
||||
{ "fc": 3, "address": 1003, "exception": 4, "description": "FC03 @1003 -> Server Failure (0x04)" },
|
||||
{ "fc": 3, "address": 1004, "exception": 5, "description": "FC03 @1004 -> Acknowledge (0x05)" },
|
||||
{ "fc": 3, "address": 1005, "exception": 6, "description": "FC03 @1005 -> Server Busy (0x06)" },
|
||||
{ "fc": 3, "address": 1006, "exception": 10, "description": "FC03 @1006 -> Gateway Path Unavailable (0x0A)" },
|
||||
{ "fc": 3, "address": 1007, "exception": 11, "description": "FC03 @1007 -> Gateway Target No Response (0x0B)" },
|
||||
|
||||
{ "fc": 6, "address": 2000, "exception": 4, "description": "FC06 @2000 -> Server Failure (0x04, e.g. CPU in PROGRAM mode)" },
|
||||
{ "fc": 6, "address": 2001, "exception": 6, "description": "FC06 @2001 -> Server Busy (0x06)" },
|
||||
|
||||
{ "fc": 16, "address": 3000, "exception": 4, "description": "FC16 @3000 -> Server Failure (0x04)" }
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,122 @@
|
||||
using Shouldly;
|
||||
using Xunit;
|
||||
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests;
|
||||
|
||||
/// <summary>
|
||||
/// End-to-end verification that the driver's <c>MapModbusExceptionToStatus</c>
|
||||
/// translation is wire-correct for every exception code in the mapping table —
|
||||
/// not just 0x02, which is the only code the pymodbus simulator naturally emits.
|
||||
/// Drives the standalone <c>exception_injector.py</c> server (<c>exception_injection</c>
|
||||
/// compose profile) at each of the rule addresses in
|
||||
/// <c>Docker/profiles/exception_injection.json</c> and asserts the driver surfaces
|
||||
/// the expected OPC UA StatusCode.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// Why integration coverage on top of the unit tests: the unit tests prove the
|
||||
/// translation function is correct; these prove the driver wires it through on
|
||||
/// the read + write paths unchanged, after the MBAP header + PDU round-trip
|
||||
/// (where a subtle framing bug could swallow or misclassify the exception).
|
||||
/// </remarks>
|
||||
[Collection(ModbusSimulatorCollection.Name)]
|
||||
[Trait("Category", "Integration")]
|
||||
[Trait("Device", "ExceptionInjection")]
|
||||
public sealed class ExceptionInjectionTests(ModbusSimulatorFixture sim)
|
||||
{
|
||||
private const uint StatusGood = 0u;
|
||||
private const uint StatusBadOutOfRange = 0x803C0000u;
|
||||
private const uint StatusBadNotSupported = 0x803D0000u;
|
||||
private const uint StatusBadDeviceFailure = 0x80550000u;
|
||||
private const uint StatusBadCommunicationError = 0x80050000u;
|
||||
|
||||
private void SkipUnlessInjectorLive()
|
||||
{
|
||||
if (sim.SkipReason is not null) Assert.Skip(sim.SkipReason);
|
||||
var profile = Environment.GetEnvironmentVariable("MODBUS_SIM_PROFILE");
|
||||
if (!string.Equals(profile, "exception_injection", StringComparison.OrdinalIgnoreCase))
|
||||
Assert.Skip("MODBUS_SIM_PROFILE != exception_injection — skipping. " +
|
||||
"Start the fixture with --profile exception_injection.");
|
||||
}
|
||||
|
||||
private async Task<IReadOnlyList<DataValueSnapshot>> ReadSingleAsync(int address, string tagName)
|
||||
{
|
||||
var opts = new ModbusDriverOptions
|
||||
{
|
||||
Host = sim.Host,
|
||||
Port = sim.Port,
|
||||
UnitId = 1,
|
||||
Timeout = TimeSpan.FromSeconds(2),
|
||||
Tags =
|
||||
[
|
||||
new ModbusTagDefinition(tagName,
|
||||
ModbusRegion.HoldingRegisters, Address: (ushort)address,
|
||||
DataType: ModbusDataType.UInt16, Writable: false),
|
||||
],
|
||||
Probe = new ModbusProbeOptions { Enabled = false },
|
||||
};
|
||||
await using var driver = new ModbusDriver(opts, driverInstanceId: "modbus-exc");
|
||||
await driver.InitializeAsync("{}", TestContext.Current.CancellationToken);
|
||||
return await driver.ReadAsync([tagName], TestContext.Current.CancellationToken);
|
||||
}
|
||||
|
||||
[Theory]
|
||||
[InlineData(1000, StatusBadNotSupported, "exc 0x01 (Illegal Function) -> BadNotSupported")]
|
||||
[InlineData(1001, StatusBadOutOfRange, "exc 0x02 (Illegal Data Address) -> BadOutOfRange")]
|
||||
[InlineData(1002, StatusBadOutOfRange, "exc 0x03 (Illegal Data Value) -> BadOutOfRange")]
|
||||
[InlineData(1003, StatusBadDeviceFailure, "exc 0x04 (Server Failure) -> BadDeviceFailure")]
|
||||
[InlineData(1004, StatusBadDeviceFailure, "exc 0x05 (Acknowledge / long op) -> BadDeviceFailure")]
|
||||
[InlineData(1005, StatusBadDeviceFailure, "exc 0x06 (Server Busy) -> BadDeviceFailure")]
|
||||
[InlineData(1006, StatusBadCommunicationError, "exc 0x0A (Gateway Path Unavailable) -> BadCommunicationError")]
|
||||
[InlineData(1007, StatusBadCommunicationError, "exc 0x0B (Gateway Target No Response) -> BadCommunicationError")]
|
||||
public async Task FC03_read_at_injection_address_surfaces_expected_status(
|
||||
int address, uint expectedStatus, string scenario)
|
||||
{
|
||||
SkipUnlessInjectorLive();
|
||||
var results = await ReadSingleAsync(address, $"Injected_{address}");
|
||||
results[0].StatusCode.ShouldBe(expectedStatus, scenario);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task FC03_read_at_non_injected_address_returns_Good()
|
||||
{
|
||||
// Sanity: HR[0..31] are seeded with address-as-value in the profile. A read at
|
||||
// one of those addresses must come back Good (0) — otherwise the injector is
|
||||
// misbehaving and every other assertion in this class is uninformative.
|
||||
SkipUnlessInjectorLive();
|
||||
var results = await ReadSingleAsync(address: 5, tagName: "Healthy_5");
|
||||
results[0].StatusCode.ShouldBe(StatusGood);
|
||||
results[0].Value.ShouldBe((ushort)5);
|
||||
}
|
||||
|
||||
[Theory]
|
||||
[InlineData(2000, StatusBadDeviceFailure, "exc 0x04 on FC06 -> BadDeviceFailure (CPU in PROGRAM mode)")]
|
||||
[InlineData(2001, StatusBadDeviceFailure, "exc 0x06 on FC06 -> BadDeviceFailure (Server Busy)")]
|
||||
public async Task FC06_write_at_injection_address_surfaces_expected_status(
|
||||
int address, uint expectedStatus, string scenario)
|
||||
{
|
||||
SkipUnlessInjectorLive();
|
||||
var tag = $"InjectedWrite_{address}";
|
||||
var opts = new ModbusDriverOptions
|
||||
{
|
||||
Host = sim.Host,
|
||||
Port = sim.Port,
|
||||
UnitId = 1,
|
||||
Timeout = TimeSpan.FromSeconds(2),
|
||||
Tags =
|
||||
[
|
||||
new ModbusTagDefinition(tag,
|
||||
ModbusRegion.HoldingRegisters, Address: (ushort)address,
|
||||
DataType: ModbusDataType.UInt16, Writable: true),
|
||||
],
|
||||
Probe = new ModbusProbeOptions { Enabled = false },
|
||||
};
|
||||
await using var driver = new ModbusDriver(opts, driverInstanceId: "modbus-exc-write");
|
||||
await driver.InitializeAsync("{}", TestContext.Current.CancellationToken);
|
||||
|
||||
var writes = await driver.WriteAsync(
|
||||
[new WriteRequest(tag, (ushort)42)],
|
||||
TestContext.Current.CancellationToken);
|
||||
writes[0].StatusCode.ShouldBe(expectedStatus, scenario);
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user