Commit Graph

12 Commits

Author SHA1 Message Date
Joseph Doherty 2697af31d1 feat(opcua,host): #81 ServiceLevel SDK publisher
SdkServiceLevelPublisher writes Server.ServiceLevel through the SDK's
ServerObjectState — the standard OPC UA non-transparent-redundancy signal
clients use to pick a primary. Writes are guarded by DiagnosticsLock so
concurrent SDK diagnostics scans don't fight with our updates.

DeferredServiceLevelPublisher mirrors the DeferredAddressSpaceSink late-
binding pattern: Akka actors resolve IServiceLevelPublisher at construction,
hosted service swaps the SDK publisher in after StandardServer.Start. Host
Program.cs registers DeferredServiceLevelPublisher as the singleton bound
to IServiceLevelPublisher; OtOpcUaServerHostedService gets it injected and
fills it once IServerInternal is available.

Tests boot a real StandardServer on a free port (cross-platform), call
Publish, then verify ServerObject.ServiceLevel.Value reflects the write.
5 new tests; OpcUaServer suite now 45/45 green (was 40, +5).

Closes #81 residual. Unblocks Task 60 (OPC UA dual-endpoint + ServiceLevel
tests).
2026-05-26 10:37:42 -04:00
Joseph Doherty 52997ee164 feat(observability): F13d Prometheus + OpenTelemetry instrumentation
v2-ci / build (push) Failing after 38s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (push) Has been skipped
OtOpcUaTelemetry (Commons/Observability) centralizes the project's Meter
+ ActivitySource so all instrumentation points emit through a single
named surface. Counters cover the hot paths:

  otopcua.deploy.applied               (outcome=ack|reject)
  otopcua.deploy.apply.duration        (s, histogram)
  otopcua.driver.lifecycle             (event=spawn|spawn_stub|stop|fault)
  otopcua.virtualtag.eval              (outcome=ok|fail|skip)
  otopcua.scriptedalarm.transition     (state=activated|acknowledged|cleared)
  otopcua.opcua.sink.write             (kind=value|alarm|rebuild)
  otopcua.redundancy.service_level_change (level=byte)

Plus two ActivitySource spans:

  otopcua.deploy.apply                 wraps DriverHostActor.ApplyAndAck
  otopcua.opcua.address_space_rebuild  wraps OpcUaPublishActor.HandleRebuild

Instruments are no-op until a listener attaches, so tests + dev hosts
pay nothing for unread telemetry.

Host Program.cs gains AddOtOpcUaObservability() (binds the OtOpcUa Meter
+ ActivitySource to OpenTelemetry, attaches a Prometheus exporter) and
MapOtOpcUaMetrics() (mounts /metrics scrape endpoint). Driver-side
internals + ASP.NET request metrics deliberately stay off — the scrape
payload is scoped to OtOpcUa signals only.

Tests use MeterListener + ActivityListener to verify
VirtualTagActor.eval, OpcUaPublishActor.AttributeValueUpdate, and
RebuildAddressSpace actually emit on the central instruments. Runtime
suite is 72 / 72 green (+3).

Closes #105. Path A (F13b/c/d) complete; next batch options: #85 UNS
folder hierarchy in SDK, or F8b/F9b production engine bindings.
2026-05-26 10:29:40 -04:00
Joseph Doherty 50787823d3 feat(host,runtime): #108 Host DI bindings — OPC UA server + deferred sink
v2-ci / build (push) Failing after 45s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (push) Has been skipped
Wires the OPC UA SDK into the fused Host's lifecycle on driver-role
nodes + spawns OpcUaPublishActor with the proper sink/publisher/dbFactory/
applier resolution. The full read+write data path is now live in
production: Deploy → DriverHost → OpcUaPublish → SDK NodeManager →
subscribed OPC UA clients.

DeferredAddressSpaceSink (Commons.OpcUa):
  - Thread-safe wrapper IOpcUaAddressSpaceSink that delegates to an
    inner sink swapped in at runtime. Needed because Akka actors
    resolve the sink at construction time, but the production sink
    (SdkAddressSpaceSink wrapping OtOpcUaNodeManager) only exists
    after the SDK StandardServer has started.
  - Defaults to NullOpcUaAddressSpaceSink so calls before swap are
    safe; SetSink(null) reverts (for graceful shutdown).

OtOpcUaServerHostedService (Host.OpcUa):
  - IHostedService that owns the OPC UA SDK lifecycle. Reads
    OpcUaApplicationHostOptions from the 'OpcUa' config section,
    creates an OtOpcUaSdkServer, boots it through OpcUaApplicationHost,
    then swaps a real SdkAddressSpaceSink into the DeferredAddressSpaceSink
    singleton.
  - SDK boot failure is logged + non-fatal — the rest of the host
    (admin UI, driver actors) keeps running. Stop reverts to null sink.

WithOtOpcUaRuntimeActors (Runtime):
  - Now spawns OpcUaPublishActor (new actor) + threads its ActorRef
    into DriverHostActor's Props so successful applies trigger the
    address-space rebuild pipeline.
  - Phase7Applier is constructed here from the resolved sink + a
    logger; OpcUaPublishActor takes both.
  - Prepends the opcua-synchronized-dispatcher HOCON so the extension
    is self-contained — consumers (Host, tests) don't need to redeclare
    the dispatcher block.
  - New OpcUaPublishActorKey + OpcUaPublishActorName for actor-registry
    resolution.
  - AddOtOpcUaRuntime now also TryAddSingleton's NullOpcUaAddressSpaceSink
    + NullServiceLevelPublisher so admin-only nodes (or tests that
    don't bind the Deferred sink) stay safe.

Host.Program.cs (driver-role only):
  - Binds DeferredAddressSpaceSink as singleton + as IOpcUaAddressSpaceSink
  - AddHostedService<OtOpcUaServerHostedService>()

Tests: OpcUaServer 24 -> 28 (+4 DeferredAddressSpaceSink unit tests),
Runtime 69 -> 69 (existing ServiceCollectionExtensionsTests extended
to verify the new mux + publish actor registration).

All 6 v2 test suites green: 177 tests passing.

Closes #108. Engine-wiring is now production-bound end-to-end on
driver-role nodes — Deploy reaches real OPC UA Variable nodes that
subscribed clients see.
2026-05-26 10:02:15 -04:00
Joseph Doherty f427dc4f26 feat(runtime): #112 ScriptedAlarmActor state persistence via IAlarmActorStateStore
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (push) Has been skipped
ScriptedAlarmActor now survives actor restart: PreStart loads from
the configured store + restores in-memory state; every Transition()
fires a fire-and-forget save. ActiveState still re-derives from the
evaluator on first tick (Phase 7 decision #14), but Acked state +
lastAckUser persist verbatim so operators don't re-ack across an
outage.

Three pieces:
- IAlarmActorStateStore seam in Commons.Engines, with the
  AlarmActorStateSnapshot record (alarmId / state / lastTransitionUtc
  / lastAckUser) and NullAlarmActorStateStore default.
- EfAlarmActorStateStore in Runtime.ScriptedAlarms — production
  adapter over the existing ScriptedAlarmState table in ConfigDb.
  Maps the actor's 3-state enum to the table's AckedState column
  (Active⇒Unacknowledged, Acknowledged⇒Acknowledged, Inactive⇒
  Acknowledged). Concurrency conflicts are logged + dropped — the
  next transition writes again.
- ScriptedAlarmActor PreStart load (async, piped back as
  StateRestored) + Transition save. New Props overload takes the
  store; default is NullAlarmActorStateStore so tests stay quiet.

Tests: Runtime 52 -> 57 (+5):
- Transition writes Active then Acknowledged snapshots with
  lastAckUser populated
- PreStart with persisted Active state restores so a subsequent
  AcknowledgeAlarm fires (not ignored as it would be from Inactive)
- Empty store boots Inactive (AcknowledgeAlarm correctly ignored)
- EfAlarmActorStateStore Save + Load round-trips via in-memory EF
- Load for unknown alarmId returns null

All 6 v2 test suites green: 157 tests passing.

Closes #112. F9 (#80) remaining residual is predicate binding to
Core.ScriptedAlarms.ScriptedAlarmEngine — split as F9b in tasks JSON.
2026-05-26 09:34:37 -04:00
Joseph Doherty a1325299ce feat(runtime): F10 OpcUaPublishActor sink seams + redundancy-driven ServiceLevel
OpcUaPublishActor now routes through pluggable seams instead of just
incrementing a counter:

- IOpcUaAddressSpaceSink (Commons.OpcUa) — WriteValue / WriteAlarmState
  / RebuildAddressSpace. OpcUaQuality enum moved here from the actor's
  nested type so producers don't have to reference the actor itself.
- IServiceLevelPublisher — Publish(byte). NullServiceLevelPublisher
  retains the last level for inspection.
- The actor subscribes to the redundancy-state DPS topic in PreStart
  and maps the local node's NodeRedundancyState to a coarse
  ServiceLevel (Primary+leader=240, Primary=200, Secondary=100,
  Detached=0). This keeps the local SDK's ServiceLevel node honest
  without round-tripping back through the admin-singleton calculator.
- ServiceLevelChanged dedupes identical levels so the SDK doesn't see
  redundant writes.
- Sink + publisher exceptions are caught and logged; the actor never
  crashes its own dispatcher.
- PropsForTests gets optional sink/publisher/localNode params and
  skips the DPS subscribe so unit tests stay on a vanilla TestKit
  cluster.

Production binding to a real SDK NodeManager + Variable nodes is the
remaining residual — split as F10b. Task 60 still blocked on F10b.

Tests: Runtime 40 -> 46 (+6):
- AttributeValueUpdate routes to sink
- AlarmStateUpdate routes to sink
- RebuildAddressSpace calls sink.Rebuild
- ServiceLevelChanged dedupes
- RedundancyStateChanged for primary-leader publishes 240
- RedundancyStateChanged for secondary publishes 100

All 6 v2 test suites green: 132 tests passing.
2026-05-26 09:10:55 -04:00
Joseph Doherty 14fb2b05ed feat(runtime): F8/F9 engine evaluator seams + DPS fan-out
VirtualTagActor and ScriptedAlarmActor now route through pluggable
evaluator interfaces and fan out to the cluster's live-tail topics
shipped in F15.3:

- IVirtualTagEvaluator + NullVirtualTagEvaluator in Commons.Engines.
  VirtualTagActor calls evaluator on every DependencyValueChanged,
  dedupes unchanged values, forwards EvaluationResult to its parent,
  and publishes ScriptLogEntry Warning to the script-logs DPS topic
  whenever the evaluator fails.

- IScriptedAlarmEvaluator + NullScriptedAlarmEvaluator. ScriptedAlarmActor
  takes an AlarmConfig (id/name/equipment-path/severity/predicate) and
  publishes both an AlarmTransitionEvent (alerts topic) and a
  ScriptLogEntry (script-logs topic) at every transition. Manual
  ConditionMet/Acknowledge/Cleared still flow through the same
  Transition() so callers without engine bindings still drive the
  state machine; the legacy single-string Props() overload routes
  through a default AlarmConfig.

The Null* defaults keep the actors safe when no engine is bound —
unconfigured nodes never spuriously alarm. Production binding to
Core.VirtualTags.VirtualTagEngine and Core.ScriptedAlarms is the
remaining residual (F8b/F9b — split in tasks JSON).

Tests: Runtime 34 -> 40 (+6):
- VirtualTagActorTests x3 (evaluator drives EvaluationResult,
  unchanged-value dedup, failure publishes Warning ScriptLogEntry)
- ScriptedAlarmActorTests x3 (engine threshold drives Activated +
  Cleared on alerts topic, manual Acknowledge attribution).

All 6 v2 test suites green: 126 tests passing.
2026-05-26 09:05:04 -04:00
Joseph Doherty 59858129cb feat(adminui): F15.3 closes F15 — live alerts/script-log, CSV import, Monaco editor
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been cancelled
v2-ci / build (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been cancelled
v2-ci / integration (push) Has been cancelled
Final F15 batch wires up the SignalR-backed live pages, ports the bulk
equipment importer, and progressively enhances the Script source editor
with Monaco.

Message contracts:
- Commons.Messages.Alerts.AlarmTransitionEvent — fires on every alarm
  state transition; published on the `alerts` DPS topic by future
  ScriptedAlarmActor (F9) emits.
- Commons.Messages.Logging.ScriptLogEntry — one log line emitted by a
  hosted script; published on the `script-logs` DPS topic by future
  VirtualTagActor (F8) + ScriptedAlarmActor (F9) emits.
  (Folder named "Logging" to dodge .gitignore's "logs/" rule.)

SignalR plumbing:
- AlertHub gains MethodName + bridge actor (AlertSignalRBridge)
- ScriptLogHub introduced; ScriptLogSignalRBridge follows the same
  DPS-subscribe → IHubContext fan-out pattern as FleetStatusSignalRBridge
- WithOtOpcUaSignalRBridges now spawns all three bridges
- MapOtOpcUaHubs maps /hubs/script-log alongside the existing hubs

Pages:
- /alerts                      live alarm tail, 200-row capacity
- /script-log                  live script-log tail with level + script
                               filter, 500-row capacity
- /clusters/{id}/equipment/import — CSV bulk Equipment add with preview
                                    (Name/MachineCode/UnsLineId/Driver +
                                    optional ZTag/SAPID/Manufacturer/Model;
                                    skips rows whose MachineCode already
                                    exists in the fleet)
- ScriptEdit progressively enhanced with Monaco editor via JSInterop —
  the textarea remains Blazor's source of truth and Monaco syncs into it
  on every keystroke so @bind keeps working; falls back gracefully if
  the CDN is unreachable.

MainLayout nav gains a "Live" section (Deployments, Alerts, Alarms
historian) and a "Scripts" link under Scripting. ClusterEquipment
surfaces the new Import CSV button.

Tally: F15 ships ~42 razor pages + 3 SignalR hubs + 3 bridge actors.
Microsoft.AspNetCore.SignalR.Client added (was already in central PM).

All 104 v2 tests remain green.
2026-05-26 08:39:17 -04:00
Joseph Doherty 8f32b89fb9 feat(adminui): FleetDiagnosticsClient real Akka ActorSelection round-trip (F17)
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been cancelled
v2-ci / build (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been cancelled
v2-ci / integration (push) Has been cancelled
- New Commons.Messages.Fleet.GetDiagnostics request record.
- DriverHostActor handles GetDiagnostics in all three states (Steady, Applying,
  Stale); replies with a NodeDiagnosticsSnapshot built from _currentRevision
  + the local NodeId. Drivers list is empty until F7 wires the per-instance
  children.
- FleetDiagnosticsClient now resolves the target via ActorSelection at
  akka.tcp://{system}@{nodeId}/user/driver-host and Asks with a 3s timeout.
  On timeout/peer-down it returns an empty snapshot so the UI degrades
  gracefully rather than throwing.

Two new integration tests in Host.IntegrationTests:
- GetDiagnostics_returns_snapshot_with_target_NodeId verifies the
  cross-node Ask/Reply works.
- GetDiagnostics_after_deploy_reports_current_revision exercises the
  end-to-end path: AdminOps starts a deployment, both DriverHostActors
  apply, then diagnostics reports the new revision on both nodes.

All 98 v2 tests pass (was 96 + 2 new).
2026-05-26 06:58:11 -04:00
Joseph Doherty 136234e7f2 feat(commons): add cluster/admin/diagnostics client interfaces 2026-05-26 04:27:19 -04:00
Joseph Doherty 5d3a5a40d7 feat(commons): add deploy/admin/audit/redundancy/fleet message contracts 2026-05-26 04:27:18 -04:00
Joseph Doherty fee4a8c008 feat(commons): add correlation/execution/node/deployment/revisionhash types 2026-05-26 04:26:01 -04:00
Joseph Doherty 30a2104fa5 feat(scaffold): introduce 8 v2 component projects
Adds the empty project skeletons that subsequent v2 tasks fill in:

  src/Core/ZB.MOM.WW.OtOpcUa.Commons      (types, interfaces, message contracts)
  src/Core/ZB.MOM.WW.OtOpcUa.Cluster      (Akka.Hosting + cluster wiring)
  src/Server/ZB.MOM.WW.OtOpcUa.Security   (cookie+JWT auth, LDAP)
  src/Server/ZB.MOM.WW.OtOpcUa.ControlPlane (admin-role cluster singletons)
  src/Server/ZB.MOM.WW.OtOpcUa.Runtime    (per-node driver actors)
  src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer (OPC UA SDK application host)
  src/Server/ZB.MOM.WW.OtOpcUa.AdminUI    (Razor class library)
  src/Server/ZB.MOM.WW.OtOpcUa.Host       (single fused web binary)

Each project sets TreatWarningsAsErrors=true in its own csproj (per the
Directory.Build.props deviation note in the previous commit). NuGetAuditSuppress
entries cover transitive vulnerability advisories the new strictness surfaces:

  - GHSA-g94r-2vxg-569j (OpenTelemetry.Api 1.9.0 via Akka.Cluster.Hosting/Tools)
  - GHSA-h958-fxgg-g7w3 (Opc.Ua.Core 1.5.374.126 via OpcUaServer)
  - GHSA-37gx-xxp4-5rgx + GHSA-w3x6-4m5h-cxqf (legacy advisories already accepted)

OpcUaServer pins OPCFoundation.NetStandard.Opc.Ua.Configuration to 1.5.374.126
via VersionOverride to match Opc.Ua.Server's transitive Opc.Ua.Core (same
constraint as the legacy Server project).

Runtime does NOT project-reference any concrete Driver.* assemblies; drivers
load reflectively at runtime (Phase 6). Runtime gets the IDriver contract
through Core.Abstractions instead.

Host's Microsoft.Extensions.Hosting.WindowsServices is conditional on the
Windows OS so the project builds on macOS dev machines.

Build verification: dotnet build -> 438 warnings (all pre-existing xUnit1051
in legacy Server.Tests/Admin.Tests), 0 errors. Closes Task 9 (build green
smoke check, no separate commit).
2026-05-26 03:44:56 -04:00