After A/B/C all merge, the running services on C:\publish need to be refreshed before the Galaxy alarm-event family flows end-to-end. Add PR D.1: a Refresh-Services.ps1 script + runbook for stopping in reverse-dependency order, restaging binaries from the build outputs, restarting in forward-dependency order, and capturing a smoke-run artifact. D.1 gates B.5 (docs sweep) — the documentation records the as-deployed shape, so the deployment has to be live first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
40 KiB
Plan — alarms over the mxaccessgw gateway
Coordinated epic across two repos:
lmxopcua(this repo) —c:\Users\dohertj2\Desktop\lmxopcua\mxaccessgw—c:\Users\dohertj2\Desktop\mxaccessgw\
Why
PR 7.2 (2026-04-30, commit ae7106d) retired the in-process v1 Galaxy stack
(Driver.Galaxy.Host / .Proxy / .Shared + OtOpcUaGalaxyHost Windows
service) and migrated Galaxy access to the in-process GalaxyDriver over
mxaccessgw's gRPC. In doing so, three v1 capabilities regressed:
- Native MxAccess alarm-event metadata — v1's
GalaxyAlarmTrackersurfaced rich alarm transitions (operator comment, original raise time, ack time, alarm category, native severity). The current architecture reconstructs Part 9 transitions by subscribing to four sub-attribute value updates (InAlarm,Acked,Priority,Description) — fine for raise/clear but loses everything else. - Native MxAccess Acknowledge semantics — v1 called the MxAccess ack
API directly from
GalaxyAlarmTracker. Today, OPC UA acks are written into theAckMsgWriteRefsub-attribute — semantically valid but a round-trip through the value path that loses operator-comment fidelity. - Alarm-historian write-back path for non-Galaxy alarm sources.
v1's
GalaxyHistorianWriterimplementedIAlarmHistorianWriterand forwarded scripted-alarm transitions (and any future non-Galaxy alarm source — AB CIP ALMD, OpcUaClient A&E, etc.) back to AVEVA Historian viaaahClientManaged. PR 7.2 deleted it.Phase7Composer.ResolveHistorianSinknow finds no writer and falls back toNullAlarmHistorianSink, so scripted-alarm transitions queue locally and silently discard. Galaxy-native alarms (with$Alarm*extensions) reach AVEVA Historian via System Platform's ownHistorizeToAvevatoggle on the Galaxy template — that path was never broken and is not in scope for this epic.
gateway.md (mxaccessgw, line 8) explicitly commits the gateway to "full
MXAccess parity… preserve MXAccess behavior first… native MXAccess event
families." Today's gateway proto exposes only data-change families. Closing
the alarm regression and fulfilling that parity statement are the same task.
Goals
- Restore all three regressed capabilities to feature parity with v1.
- Keep the v2 architectural split — gateway owns MxAccess transport; lmxopcua owns OPC UA Part 9 semantics, ACL/role enforcement, and multi-source aggregation (driver-native + scripted + sub-attribute).
- Preserve the value-driven sub-attribute path as a fallback for Galaxy
templates that don't carry
$Alarm*extensions. - Land the work as a sequence of small, independently-reviewable PRs that alternate between repos in dependency order.
Non-goals
- Reimplementing the Part 9 state machine inside mxaccessgw. The gateway stays UA-agnostic.
- Reworking the LDAP role-grant or OPC UA AlarmAck ACL surface — those
already exist and route through
Server/Alarms/IAlarmAcknowledger. - Adding alarm support to non-Galaxy drivers (AbCip / FOCAS / OpcUaClient
already have their own
IAlarmSourceimplementations; Modbus / S7 / AbLegacy / TwinCAT don't have a native alarm bus and are out of scope). - Altering Galaxy template conventions or
$Alarm*extensions in the customer's Galaxy.
Before → after
Today (post-PR 7.2):
MxAccess COM (gateway worker)
│ data-change events only on the MxEvent stream
▼
GalaxyDriver (no IAlarmSource)
│ IWritable / ISubscribable / ITagDiscovery only
▼
DriverNodeManager
├─ subscribes to four $Alarm* sub-attributes per condition
├─ AlarmConditionService rebuilds Part 9 transitions from value updates
└─ DriverWritableAcknowledger writes AckMsgWriteRef on ack
Phase7Composer.ResolveHistorianSink → NullAlarmHistorianSink
(scripted-alarm transitions queue → silently discarded)
After this epic:
MxAccess COM (gateway worker)
│ data-change ──┐
│ alarm-transition │
│ write-complete ├─► single MxEvent stream (new family added)
▼ ▼
GalaxyDriver : ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable,
IHostConnectivityProbe, IAlarmSource ← restored
├─ EventPump dispatches OnAlarmTransition family → IAlarmSource.OnAlarmEvent
├─ AcknowledgeAsync → gateway RPC AcknowledgeAlarm
└─ QueryActiveAlarmsAsync → gateway RPC QueryActiveAlarms (ConditionRefresh)
DriverNodeManager
├─ rich alarm events from IAlarmSource.OnAlarmEvent → AlarmConditionService
├─ value-driven sub-attribute path STILL WORKS for templates without $Alarm
├─ DriverWritableAcknowledger preserved as fallback for the value path
└─ ScriptedAlarmEngine output continues to feed AlarmConditionService
Phase7Composer.ResolveHistorianSink → GatewayAlarmHistorianWriter
├─ scripted-alarm transitions → SqliteStoreAndForwardSink
└─ drain worker → gateway RPC WriteHistorianEvent → AVEVA Historian
Architecture decisions
D1 — Where the Part 9 state machine runs. Stays in lmxopcua's
AlarmConditionService. Gateway is UA-agnostic. ScriptedAlarmEngine produces
Part 9 transitions with no MxAccess origin; the aggregator must live where all
sources converge.
D2 — Where authz on Acknowledge runs. Stays in lmxopcua. The OPC UA
AlarmConditionState.OnAcknowledge delegate already checks the session's
roles for AlarmAck against the LDAP/role-grant ACL. The gateway should
never be reachable in a way that bypasses that check.
D3 — How rich alarm events reach OPC UA clients. New MxEventFamily
on the existing StreamEvents RPC (no second stream). Adds latency
parity with data-change events, reuses the bounded-channel + worker-side
delivery semantics already documented in gateway.md.
D4 — Sub-attribute fallback path stays. Some Galaxy templates won't
have $Alarm* extensions yet; the existing value-driven path remains the
only way to surface alarms for those templates. Both paths feed
AlarmConditionService. Driver-native events take precedence when both
are present (more authoritative, lower latency).
D5 — Where the historian writer lives. In the Wonderware historian
sidecar, not in the gateway. The sidecar already owns aahClientManaged,
already has a WriteAlarmEvents IPC slot defined in Ipc/Contracts.cs, and
already dispatches to an IAlarmEventWriter interface — it's just unwired
in Program.cs:57. The gateway is for MxAccess (live data + Galaxy
hierarchy); the historian sidecar is for aahClientManaged (time-series +
alarms historian). Two different SDKs, two different concerns; keep the
split. Bonus: completing the sidecar's write path also gives it a clearer
long-term role — once the REST-API migration in histsdk\instructions.md
takes over reads, write-back keeps the sidecar relevant rather than
retiring it as a read-only relic. Galaxy-native alarms bypass this
entirely — System Platform's own HistorizeToAveva toggle on the
Galaxy template publishes them directly. The sidecar write path is
exclusively for non-Galaxy producers (today: scripted alarms; future: AB
CIP ALMD or any other lmxopcua-side alarm source the customer wants
unified into AVEVA Historian).
Track A — mxaccessgw changes
All five PRs land in c:\Users\dohertj2\Desktop\mxaccessgw\.
PR A.1 — proto: add alarm-transition event family + ack/query RPCs
Files (src\MxGateway.Contracts\Protos\mxaccess_gateway.proto):
-
Extend
MxEventFamily(line 403):MX_EVENT_FAMILY_ON_ALARM_TRANSITION = 5; -
Extend
MxEvent.bodyoneof (line 395) with:OnAlarmTransitionEvent on_alarm_transition = 24; -
New message
OnAlarmTransitionEventafter the existing event-family bodies (line 425+). Carry the full MxAccess alarm payload — alarm name, source object reference, alarm-type-name (e.g. "AnalogLimitAlarm.HiHi"), transition kind enum (Raise/Acknowledge/Clear), severity (raw numeric — keep MxAccess scale; mapping to OPC UA 0-1000 happens server-side in lmxopcua),original_raise_timestamp,transition_timestamp, optionaloperator_user, optionaloperator_comment, alarmcategorystring, alarmdescription. Mirror the field set documented in v1'sGalaxyAlarmTracker. -
New RPC on
MxAccessGatewayservice (line 11):rpc AcknowledgeAlarm(AcknowledgeAlarmRequest) returns (AcknowledgeAlarmReply); rpc QueryActiveAlarms(QueryActiveAlarmsRequest) returns (stream ActiveAlarmSnapshot);AcknowledgeAlarmRequestcarriessession_id,alarm_full_reference,comment,user_principal. Reply carriesMxStatusProxy.QueryActiveAlarmsRequestcarriessession_id, optionalalarm_filter_prefix(for ConditionRefresh on a sub-tree).ActiveAlarmSnapshotcarries the same fields asOnAlarmTransitionEventpluscurrent_stateenum (Active/ActiveAcked/Inactive).
Tests (MxGateway.Tests — proto/codegen sanity):
- Round-trip Serialize→Deserialize for the new messages with all-fields populated and empty-optional-fields cases.
MxEvent.bodyoneof selection guard — supplying multiple bodies rejected.
Out of scope: worker-side wiring (PR A.2), gateway-side dispatch (PR A.3). PR A.1 is a pure contract-surface change; nothing functional yet.
PR A.2 — worker: subscribe to MxAccess alarm event source
Files (src\MxGateway.Worker\ — net48/x86):
The MxAccess Toolkit exposes alarm subscription separately from data
subscription. Per AVEVA's MXAccess C++ Toolkit reference (canonical doc
referenced from gateway.md), alarm events arrive through the
IAlarmEventSink interface registered against the MxAccess Alarms
collection of an open session, OR via the MxAccess "alarm provider"
subscription pattern (depends on Toolkit version on the worker host —
verify against the version actually deployed in the worker bin during
PR A.2).
- Worker subscribes to MxAccess alarms once per session, with a single
sink that fans out into the same bounded channel the data-change pump
uses (
MxGateway.Worker\Eventing\EventChannel.csor whatever the worker currently calls its sink — verify name during the PR). - Sink translates each MxAccess alarm event into a
WorkerEventproto (defined inmxaccess_worker.proto) carrying the newOnAlarmTransitionEventbody. Reuses the existingworker_sequencecounter so ordering is preserved across families. - Worker honours the same backpressure rules as data-change events — newest-dropped on full channel, single dropped-counter metric per family.
Tests (MxGateway.Worker.Tests):
- Fake
IAlarmEventSinksource emits canned transitions; assert the worker forwards each as the rightWorkerEventshape. - Cancellation test — closing the session unsubscribes from MxAccess alarms cleanly (no leaked sinks if the worker is recycled mid-session).
Out of scope: any gateway-side dispatch, any RPC handler — PR A.2 is worker-internal.
PR A.3 — gateway: dispatch OnAlarmTransition + implement AcknowledgeAlarm
Files (src\MxGateway.Server\):
- The session-level event multiplexer (
Sessions\SessionEventStream.csor equivalent — verify name during PR) recognizes the newWorkerEventbody and forwards as anMxEventwith familyMX_EVENT_FAMILY_ON_ALARM_TRANSITIONto the gRPCStreamEventsconsumer. - New RPC handler
AcknowledgeAlarmbuilds an MxAccessWorkerCommandcarrying anAlarmAcknowledgeCommand(new inmxaccess_worker.protounder PR A.1). Forwarded to the worker; reply mapped toAcknowledgeAlarmReplywith the MxAccessMxStatusproxy populated. - AuthN — same API-key + scope check as existing RPCs. Add a new scope
invoke:alarm-ack(mirrorsinvoke:writegranularity); existing keys without it returnPERMISSION_DENIED.
Tests (MxGateway.Tests, MxGateway.IntegrationTests):
- Unit: dispatch test — fake worker emits an
AlarmTransitionevent; assert the gateway forwards it on the liveStreamEventschannel of every subscribed session. - Integration: end-to-end against the real worker (requires the parity
rig setup — see
docs\v2\Galaxy.ParityRig.mdin lmxopcua for the MxAccess-installed dev box prerequisites). Trigger a real Galaxy alarm, assert the gateway emitsOnAlarmTransition. Acknowledge via the new RPC, assert the alarm transitions toActiveAckedand anAcknowledgetransition event is emitted back. - AuthN: existing key without
invoke:alarm-ackscope rejected.
PR A.4 — gateway: ConditionRefresh snapshot via QueryActiveAlarms
Files (src\MxGateway.Server\, src\MxGateway.Worker\):
- Worker exposes a
QueryActiveAlarmsCommandthat walks the session's active-alarm collection and streams snapshots back through the existing command-reply channel. The MxAccess Toolkit'sAlarms.GetActive()(verify exact API name during PR) is the underlying call. - Gateway RPC
QueryActiveAlarmsopens a server-streaming reply, batches snapshots through. - AuthN — new scope
invoke:alarm-query(separate from ack so a read-only client can refresh without ack rights).
Tests:
- Worker-test: synthetic active set of 0 / 1 / 100 alarms; assert pagination respects worker channel capacity.
- Integration: against the parity rig, assert a ConditionRefresh after
reconnect returns every alarm currently
ActiveorActiveAckedin the Galaxy.
Sequencing within Track A: A.1 → A.2 → A.3 → A.4. A.1 is
mechanical; A.2 + A.3 are the load-bearing changes that unlock lmxopcua
side. A.4 can ship after lmxopcua starts consuming A.3 output. The
historian-write capability moved to Track C below — the gateway
intentionally stays out of aahClientManaged.
Track B — lmxopcua changes
All five PRs land in c:\Users\dohertj2\Desktop\lmxopcua\. Each B-PR
depends on a specific A-PR — see the sequencing matrix below.
PR B.1 — EventPump: dispatch OnAlarmTransition family
Depends on: A.1 (proto), A.3 (gateway dispatching the new family).
Files:
src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs:160— currentDispatch(MxEvent ev)returns early for any non-OnDataChangefamily. Add a branch:switch (ev.Family) { case MxEventFamily.OnDataChange: DispatchDataChange(ev); break; case MxEventFamily.OnAlarmTransition: DispatchAlarmTransition(ev); break; default: return; }- New
DispatchAlarmTransitiontranslates the proto event into anAlarmEventArgs(existing type fromCore.Abstractions) and raises an internal event the driver subscribes to. - New
MxAccessSeverityMapperinDriver.Galaxy\Runtime\— maps the MxAccess raw severity into theAlarmSeverityenum + the OPC UA numeric severity (250 / 500 / 700 / 900 ladder per v1'sAlarmTracking.md).
Tests (tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\):
EventPumpAlarmTests— feed three synthetic MxEvents (raise / ack / clear); assert each firesOnAlarmEventon the driver with correct payload.- Severity-mapping table tests — every documented MxAccess severity
level → expected (
AlarmSeverity, OPC UA numeric) tuple.
PR B.2 — GalaxyDriver re-implements IAlarmSource
Depends on: A.3 (AcknowledgeAlarm RPC available), B.1 (event
dispatch).
Files:
src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs:28— extend the class declaration:public sealed class GalaxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe, IAlarmSource, IDisposable- Implement the four
IAlarmSourcemembers:SubscribeAlarmsAsync— no-op returning a sentinel handle. The driver is already subscribed for data; alarm events arrive on the same event stream once the gateway emits the new family. (Same pattern AbCip uses today — seeDriver.AbCip\AbCipDriver.cs:208.)UnsubscribeAlarmsAsync— no-op.OnAlarmEvent— wired to the EventPump branch added in B.1.AcknowledgeAsync— calls the new gateway RPC via theIGalaxyAlarmAcknowledgerabstraction (new file, mirrors theIGalaxyDataWriterpattern), withGatewayGalaxyAlarmAcknowledgeras the production implementation inRuntime\. Resilience wrapping viaAlarmSurfaceInvokerper existing pattern.
DriverInstanceFactoryfor Galaxy registersIGalaxyAlarmAcknowledgeralongside the existing data writer.
Tests:
- Subscribe-noop returns a non-null handle; unsubscribe accepts it.
- Acknowledge — fake
IGalaxyAlarmAcknowledgerrecords the call; assert the request shape and resilience-pipeline routing. - End-to-end test in
Driver.Galaxy.Tests— fake gateway emits a raise-then-ack event sequence; assert the driver firesOnAlarmEventtwice with matching alarm-id correlation.
PR B.3 — DriverNodeManager: route to driver-native when present
Depends on: B.2.
Files:
src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs— when registering anAlarmConditionStatefor a Galaxy variable, check whether the driver isIAlarmSource. If yes, prefer theOnAlarmEvent-driven path; the value-driven sub-attribute path becomes the secondary path that handles transitions the driver-native stream missed (network blip, gateway restart, gw missing the$Alarm*extension on this template).Server\Alarms\AlarmConditionService— already accepts events from multiple sources; only addition is aDriverEventOriginenum on internal transitions so the dedup logic prefers the richer driver-native record over a stale sub-attribute synthesis.IAlarmAcknowledgerresolution inDriverNodeManager— prefer the driver'sIAlarmSource.AcknowledgeAsyncoverDriverWritableAcknowledgerwhen both are available. KeepDriverWritableAcknowledgeras the fallback for templates without$Alarm*extensions.
Tests:
- Two-source-fan-in test: same alarm condition receives both a driver-native ack event and a sub-attribute value update for the same transition; assert no duplicate Part 9 transition fires.
- Acknowledger routing — driver implements
IAlarmSource→ ack-via-RPC; driver implements onlyIWritable→ ack-via-write (existing path).
PR B.4 — IAlarmHistorianWriter via the historian sidecar IPC
Depends on: C.2 (sidecar wires its IAlarmEventWriter). See Track C
for the sidecar-side work; B.4 is the lmxopcua-side consumer.
Files:
- New
src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.csimplementingIAlarmHistorianWriter. Sends batches over the existing named-pipe IPC using the already-definedWriteAlarmEventsRequest/WriteAlarmEventsReplycontracts atIpc\Contracts.cs:153. No protocol changes — the slot is wired today on the contract side; only the production behaviour and the consumer on this side need to land. Server\Phase7\Phase7Composer.ResolveHistorianSink— already scans for registeredIAlarmHistorianWriterinstances. Register the new sidecar-backed writer at server bootstrap when the historian sidecar is enabled (appsettings.jsonHistorian:Wonderware:Enabled = true).SqliteStoreAndForwardSinkthen boots with a real writer attached and theNullAlarmHistorianSinkfallback no longer applies on installs that have the sidecar deployed.
Tests:
SidecarAlarmHistorianWriteragainst a fakePipeServer— single record, batch, per-row failure modes (Ack / RetryPlease / PermanentFail) mapped from the sidecar'sPerEventOk[]reply.Phase7Composerend-to-end — start the server with the historian sidecar enabled; assertResolveHistorianSinkpicksSqliteStoreAndForwardSinkwith the new sidecar writer attached.
Note on producer scope: This path historizes non-Galaxy alarms
only. Galaxy-native alarms (with $Alarm* extensions) reach AVEVA
Historian directly via System Platform's HistorizeToAveva toggle on
the alarm primitive, with no involvement from us. Today the only live
producer feeding SqliteStoreAndForwardSink is
Phase7EngineComposer.RouteToHistorianAsync for scripted alarms; future
producers (AB CIP ALMD, FOCAS CNC alarms if a customer wants unified
storage) plug into the same path.
PR B.5 — docs + memory housekeeping
Depends on: B.1 / B.2 / B.3 / B.4 all green on the parity rig + D.1 (deployment refresh) verified on the dev rig.
Files:
docs\drivers\Galaxy.md— current text says the driver implements five capability interfaces; update to seven (IAlarmSource,IAlarmHistorianWriter-via-companion).docs\AlarmTracking.md— promote a fresh top-level doc that describes the v2-final architecture (driver-native primary path + sub-attribute fallback + scripted-alarm aggregation). Cross-link fromdocs\README.md. The v1 archive stays as historical record.docs\v1\AlarmTracking.md— extend the existing historical banner with "Restored to functional parity in this epic — seedocs\AlarmTracking.mdfor current state."- Memory entries (
C:\Users\dohertj2\.claude\projects\…\memory\):- Update
project_galaxy_via_mxgateway.md— add the alarm path restoration. - Update
project_server_history_alarm_subsystems.md— note thatPhase7Composer.ResolveHistorianSinknow finds a writer on Galaxy installs.
- Update
docs\plans\alarms-over-gateway.md(this file) — banner the doc✅ Completed YYYY-MM-DD — historical record.matching the existing v2-mxgw plan retirement convention.
Track C — historian sidecar wires the dormant write path
The Wonderware historian sidecar at
src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\ is a separately
deployable Windows service (NSSM-wrapped) that already loads
aahClientManaged x64 and serves a named-pipe IPC for read operations.
The WriteAlarmEvents IPC slot is defined but unwired (Program.cs:57
constructs HistorianFrameHandler without an alarmWriter). Track C
completes that slot. Two PRs in the sidecar + one consumer-side PR
(B.4) in lmxopcua finishes the path.
PR C.1 — sidecar: AahClientManagedAlarmEventWriter
Files (src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\):
- New
AahClientManagedAlarmEventWriter.csimplementing the existingIAlarmEventWriterinterface (defined inIpc\HistorianFrameHandler.cs:242). - Implementation calls
aahClientManaged's alarm-event write API — the same path v1'sGalaxyHistorianWriterused. Use the existingHistorianClusterEndpointPickerfor multi-node routing so write failures fail over the same way reads do. - Batch size + retry behaviour mirrors v1's
GalaxyHistorianWriterper-row outcome reporting (HistorianWriteOutcomeenum: Ack / PermanentFail / RetryPlease). Map MxStatus codes onto outcomes. - Reuses
HistorianDataSource's existing connection-pool / health gating — no new TCP work needed; the same session that serves reads can issue writes too.
Tests (tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\):
- Outcome-mapping table: every documented MxStatus on alarm-write →
expected
HistorianWriteOutcome. - Batching: 1 / 100 / 1000 events through a fake
aahClientManagedwriter; assert per-row outcome list parallel to input order. - Cluster failover: primary node returns
BadCommunicationError; picker rotates to secondary; assert eventual success.
PR C.2 — sidecar: wire IAlarmEventWriter into Program.cs
Files (src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs):
- Build an
AahClientManagedAlarmEventWriternext to the existingBuildHistorian()call. - Pass it to
HistorianFrameHandler(currently constructed at line 57 without analarmWriter). The dispatcher already routesWriteAlarmEventsRequestthrough_alarmWriterwhen non-null (HistorianFrameHandler.cs:158-172); supplying it makes the slot functional. - Gate behind a new env var
OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED(defaulttruewhenOTOPCUA_HISTORIAN_ENABLED=true). Lets a read-only deployment skip the writer registration if needed. - Update
Install-Services.ps1install-time env block in lmxopcua'sscripts\install\to include the new toggle.
Tests:
Program.csunit-test seam: assert handler is constructed with alarm writer when enabled and without when disabled.- Live integration (parity rig): write a synthetic alarm event
through the IPC; query it back via
ReadEvents; assert round-trip fidelity.
Sequencing within Track C: C.1 → C.2.
C.2's lmxopcua-side consumer is PR B.4 in Track B, which depends on C.2 being deployed.
Track D — deployment refresh
The dev box at DESKTOP-6JL3KKO runs three live services from
C:\publish\ (installed in the session that produced commit
ea04547's install scripts). Once Tracks A / B / C are merged, the
deployed binaries need to be refreshed so the running services pick
up the new alarm path. Track D is one PR — pure ops, no code change.
PR D.1 — refresh C:\publish + restart services
Depends on: A.4 + B.4 + C.2 merged (every code-change PR landed).
Order matters — services must stop in reverse-dependency order
(OtOpcUa → OtOpcUaWonderwareHistorian → MxAccessGw) and start in
forward-dependency order (MxAccessGw → OtOpcUaWonderwareHistorian
→ OtOpcUa). Touching binaries while a dependent service holds them
locked produces the publish-time MSB3027 file-lock error caught
during the original install (see commit 80104ca).
Steps (run as a single PowerShell session on the deploy host):
-
Stop in reverse order:
nssm stop OtOpcUa nssm stop OtOpcUaWonderwareHistorian nssm stop MxAccessGw Start-Sleep -Seconds 3 Get-Process MxGateway.Server, MxGateway.Worker, OtOpcUa.Server, ` OtOpcUa.Driver.Historian.Wonderware -ErrorAction SilentlyContinue | Stop-Process -Force -
Refresh mxaccessgw binaries (Track A output):
$gwSrc = "C:\Users\dohertj2\Desktop\mxaccessgw" dotnet build "$gwSrc\src\MxGateway.Worker" -c Release dotnet build "$gwSrc\src\MxGateway.Server" -c Release Copy-Item -Recurse -Force ` "$gwSrc\src\MxGateway.Server\bin\Release\net10.0\*" ` "C:\publish\mxaccessgw\Server\" Copy-Item -Recurse -Force ` "$gwSrc\src\MxGateway.Worker\bin\x86\Release\net48\*" ` "C:\publish\mxaccessgw\Worker\" -
Refresh OtOpcUa + historian sidecar binaries (Tracks B + C output):
$repo = "C:\Users\dohertj2\Desktop\lmxopcua" dotnet publish "$repo\src\ZB.MOM.WW.OtOpcUa.Server" ` -c Release -o "C:\publish\lmxopcua" dotnet publish "$repo\src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware" ` -c Release -o "C:\publish\lmxopcua\WonderwareHistorian" -
Update service env block if Track C added the new toggle:
# Pull existing env, append OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED=true # (default-on per C.2 design, but explicit assignment lets us flip false # for read-only deployments without re-installing) nssm set OtOpcUaWonderwareHistorian AppEnvironmentExtra ` (((nssm get OtOpcUaWonderwareHistorian AppEnvironmentExtra) ` + "`r`nOTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED=true")) -
Start in forward order:
nssm start MxAccessGw Start-Sleep -Seconds 4 nssm start OtOpcUaWonderwareHistorian Start-Sleep -Seconds 4 nssm start OtOpcUa Start-Sleep -Seconds 8 -
Smoke verification:
foreach ($s in 'MxAccessGw','OtOpcUaWonderwareHistorian','OtOpcUa') { (Get-Service $s).Status } foreach ($p in 5120, 4840, 4841) { Get-NetTCPConnection -LocalPort $p -State Listen ` -ErrorAction SilentlyContinue } Get-Content "C:\publish\lmxopcua\logs\otopcua-*.log" -Tail 20 Get-Content "C:\publish\mxaccessgw\stdout.log" -Tail 20 Get-Content "C:\ProgramData\OtOpcUa\historian-wonderware-*.log" -Tail 10Pass criterion: all three services
Running; ports 5120 + 4840 listening; sidecar log showsWonderware historian sidecar serving — pipe=OtOpcUaWonderwareHistorian; OtOpcUa log showsOPC UA server started — endpoint=opc.tcp://0.0.0.0:4840/OtOpcUaand a new lineIAlarmHistorianWriter resolved: Sidecar(added in B.4). -
Functional verification — fire one alarm of each kind and assert it propagates:
- Galaxy-native — raise the
OtOpcUaParityTest_001.Counter$Alarm*extension via Galaxy's alarm-fire mechanism; assert an OPC UA Part 9 transition reaches a connectedotopcua-cli alarmssubscriber with rich payload (operator-comment field non-null, original-raise-timestamp present). This validates Track A + B.1- B.2 + B.3.
- Scripted — author a one-line scripted alarm in the Admin UI
against any always-true predicate; assert the transition lands in
AVEVA Historian via
aaHistClientTrendquery (orDriver.Historian.Wonderware.IntegrationTestswith a query for the alarm event). Validates Track C + B.4. - Sub-attribute fallback — disable
IAlarmSourceon the GalaxyDriver via the test seam (B.3 will introduce one); fire an alarm; assert Part 9 transition still raised by the value-driven path. Validates the fallback wasn't broken.
- Galaxy-native — raise the
Files:
scripts\install\Refresh-Services.ps1(new — automates the above)docs\v2\dev-environment.md— add the refresh script to the dev workflow section.
Tests: smoke run on the dev rig (DESKTOP-6JL3KKO) producing
docs\plans\artifacts\d1-rollout-YYYY-MM-DD.md with the captured log
tails + smoke-test assertions. Captured artifact lands as part of the
PR.
Rollback: the refresh script keeps a timestamped backup of the
existing C:\publish\mxaccessgw\ and C:\publish\lmxopcua\ trees
before overwriting (mirrored to C:\publish\.backup-YYYY-MM-DD\).
Rollback is a stop / restore-from-backup / start sequence; no service
re-install needed since the NSSM service definitions don't change.
Production deploy: out of scope for D.1 — the dev rig is the only deployment in scope at this point. A separate PR-or-runbook lands the production refresh once the dev rig has soaked for the documented duration (parity-rig validation gate; see "Test gates" above).
Sequencing matrix
Track A (mxaccessgw) Track B (lmxopcua) Track C (sidecar)
───────────────────────── ───────────────────────── ─────────────────────────
A.1 proto (waits) C.1 AahClientManagedAlarmEventWriter
│ │ no cross-repo dep
├──────────────────────────► B.1 EventPump branch │
A.2 worker subscription │ uses proto types only │
│ │ unit-testable │
│ C.2 Program.cs wires writer
A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource │
│ │ │
│ ──►B.3 DriverNodeManager routing │
│ │
A.4 ConditionRefresh │ │
│ │
B.4 SidecarAlarmHistorianWriter
(depends on C.2 deployed)
▼
Track D (deployment)
─────────────────────────
D.1 Refresh C:\publish + restart services
(depends on A.4 + B.4 + C.2 merged)
▼
──►B.5 docs + memory + completion banner
A.1 + B.1 + C.1 can all land in parallel — none have cross-repo runtime dependencies. B.1's tests use proto types without needing a running gateway. C.1 is purely sidecar-internal. The gateway-side dispatch (A.3) gates B.2; the sidecar-side wiring (C.2) gates B.4. D.1 (deployment refresh) gates B.5 (docs) — the docs sweep records the as-deployed state, so the deploy must be live first.
Test gates
Per PR: unit tests pass + build green + analyzer clean (Roslyn
OTOPCUA0001 still wraps every alarm-capability call through
AlarmSurfaceInvoker).
End-of-epic gate: re-run the parity rig (docs\v2\Galaxy.ParityRig.md)
with these scenarios added:
- Native alarm raise — Galaxy
$Alarm*raise with operator-time metadata appears as an OPC UA Part 9 transition with full payload (no longer reconstructed from sub-attribute writes). - Native ack — OPC UA client acks; assert the gateway records the
ack against MxAccess directly (not via sub-attribute write); operator
comment present in the resulting
Acknowledgedtransition. - ConditionRefresh after reconnect — disconnect the GalaxyDriver, raise three alarms in Galaxy, reconnect; assert all three appear in the next ConditionRefresh.
- Historian write-back — fire a scripted alarm; assert it arrives in AVEVA Historian via the gateway path (use the existing Historian sidecar's read API to query it back).
- Sub-attribute fallback still works — disable
IAlarmSourceon the GalaxyDriver via test seam, fire a sub-attribute value change; assert Part 9 transition still raised.
Soak target: 24h × 1k tags (light) — same parity-rig harness but extended to also subscribe to alarms. Pass criterion: zero dropped alarm transitions, zero state-machine inversions, zero unhandled exceptions in the AlarmSurfaceInvoker pipeline.
Risks and mitigations
| Risk | Mitigation |
|---|---|
| MxAccess Toolkit alarm subscription API differs across installed AVEVA versions | PR A.2 verifies against the worker-host's installed Toolkit version; documents the exact API used. Pin the worker DLL set per major MxAccess version if needed. |
| Worker-side alarm subscription leaks between sessions if cleanup is wrong | PR A.2 includes a session-recycle test that asserts no IAlarmEventSink instances remain registered after Close. |
Gateway adds a new auth scope (invoke:alarm-ack); existing keys lack it |
PR A.3 + A.5 ship with a one-time bootstrap migration: keys with invoke:write get the new scope auto-granted on the dev rig and parity rig. Production keys are reissued via apikey rotate-key (existing CLI). |
| Two simultaneous alarm sources (driver-native + sub-attribute) double-fire transitions | PR B.3 dedup is the load-bearing design. End-to-end test #1 covers it explicitly. |
| Historian write-back batch fails mid-batch — partial success | The existing SqliteStoreAndForwardSink.HistorianWriteOutcome per-row enum + dead-letter retention already handles this; PR A.5 just exposes the same outcome shape over gRPC. |
Sidecar starts honouring the WriteAlarmEvents slot — old lmxopcua-side consumers can now reach a previously inert path |
The slot returns Success=false, Error="not configured" today; flipping to live writes means a build that speculatively sent the frame would suddenly start producing real historian rows. Inventory of any such caller is empty — WriteAlarmEvents was never invoked from the lmxopcua side; Phase7EngineComposer.RouteToHistorianAsync queues into SqliteStoreAndForwardSink and the drain worker is gated on IAlarmHistorianWriter registration which only the new B.4 path provides. So enabling C.2 without B.4 is safe. |
Roll-out
Track A lands first onto mxaccessgw/main, deployed to the parity rig.
Track B lands onto lmxopcua/master once A.3 is live on the rig — earlier
Track B PRs can target a feature branch (feat/alarms-over-gateway) and
merge to master after the rig is fully green.
Back-out
Each PR is individually revertable. The cleanest back-out point is at
the gateway-side enum extension: removing MX_EVENT_FAMILY_ON_ALARM_TRANSITION
from the proto means EventPump silently drops alarm events again and
GalaxyDriver's OnAlarmEvent never fires — but the sub-attribute fallback
path still produces functional alarms, so the OPC UA surface degrades to
v2-current behaviour without breaking. PR B.4 is the only one with a
non-trivial back-out (re-add the deleted sidecar IPC slot if revert
needed); land B.4 last and only after end-of-epic gate is green.
Out of scope (explicit)
- Other alarm sources beyond Galaxy. AbCip / FOCAS / OpcUaClient
drivers already implement
IAlarmSource; they're untouched. - Modbus / S7 / AbLegacy / TwinCAT alarms. None of those protocols has a native alarm bus. Alarms on those drivers, if needed, ship via the scripted-alarm path.
- Multi-Galaxy ack routing. Today's gateway model is one Galaxy per session; if a deployment splits across galaxies, each gets its own GalaxyDriver and they don't cross-talk. No change.
- OPC UA Part 9 advanced features beyond the current scope — shelving, subscribed-to-events-only, branch-state for re-trigger semantics. Future epic if a customer asks.
- Insight / cloud Historian write-back path. Track A.5 targets the
on-prem AVEVA Historian via aahClientManaged. The cloud variant
would mirror the same gateway RPC over the REST API discussed in
docs/histsdk— separate epic.
File inventory (touched)
mxaccessgw (Track A):
src\MxGateway.Contracts\Protos\mxaccess_gateway.proto(A.1)src\MxGateway.Contracts\Protos\mxaccess_worker.proto(A.2, A.4)src\MxGateway.Worker\…\Eventing\(A.2, A.3, A.4)src\MxGateway.Worker\…\Commands\(A.3, A.4)src\MxGateway.Server\Sessions\SessionEventStream.cs(A.3)src\MxGateway.Server\Rpc\(A.3, A.4)src\MxGateway.Server\Auth\Scopes.cs(A.3, A.4)MxGateway.Tests,MxGateway.Worker.Tests,MxGateway.IntegrationTests
lmxopcua — Galaxy driver + server (Track B):
src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs(B.1)src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\MxAccessSeverityMapper.cs(new — B.1)src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\IGalaxyAlarmAcknowledger.cs(new — B.2)src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayGalaxyAlarmAcknowledger.cs(new — B.2)src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs(B.2)src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs(B.2)src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs(B.3)src\ZB.MOM.WW.OtOpcUa.Server\Alarms\AlarmConditionService.cs(B.3)src\ZB.MOM.WW.OtOpcUa.Server\Phase7\Phase7Composer.cs(B.4)src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs(new — B.4)tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\(B.1, B.2)tests\ZB.MOM.WW.OtOpcUa.Server.Tests\Alarms\(B.3)tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests\(B.4 — new tests)docs\drivers\Galaxy.md(B.5)docs\AlarmTracking.md(new — B.5)docs\v1\AlarmTracking.md(B.5 — banner update)docs\plans\alarms-over-gateway.md(B.5 — completion banner)
lmxopcua — Wonderware historian sidecar (Track C):
src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\AahClientManagedAlarmEventWriter.cs(new — C.1)src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs(C.2 — wire writer)scripts\install\Install-Services.ps1(C.2 — env-var toggle for write-enable)tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\(C.1 — outcome mapping + batch + cluster failover)
lmxopcua — deployment refresh (Track D):
scripts\install\Refresh-Services.ps1(new — D.1)docs\v2\dev-environment.md(D.1 — document the refresh workflow)docs\plans\artifacts\d1-rollout-YYYY-MM-DD.md(new — D.1 captured smoke run)
Total: ~10 source files added/modified in mxaccessgw; ~14 in lmxopcua proper; ~3 in the historian sidecar; ~2 deployment scripts; ~12 test files across all repos. Should land in 4-6 weeks of focused work given the parity-rig dependency for end-to-end validation, plus a short final-week ops slot for D.1.