Files

Joseph Doherty 99354bfaf2 fix(core-scripted-alarms): resolve Low code-review findings (Core.ScriptedAlarms-003,006,008,010,011; -009 documented)

- Core.ScriptedAlarms-003: emit OnEvent OUTSIDE _evalGate by collecting
  pending emissions during the gate-held section and flushing them after
  release; eliminates re-entrancy deadlock the docs already promised.
- Core.ScriptedAlarms-006: track every fire-and-forget Reevaluate /
  ShelvingCheck task in _inFlight; Dispose drains the set so the engine
  no longer races store writes against teardown.
- Core.ScriptedAlarms-008: store comments as ImmutableList<AlarmComment>
  so AppendComment is O(log n) instead of O(n).
- Core.ScriptedAlarms-010: document the deliberate input-quality
  asymmetry (Uncertain drives the predicate, renders {?} in the message)
  in docs/ScriptedAlarms.md and on MessageTemplate.Resolve remarks.
- Core.ScriptedAlarms-011: propagate the no-op reason through
  TransitionResult.NoOp(state, reason) and log it from
  ScriptedAlarmEngine.ApplyAsync.
- Core.ScriptedAlarms-009 (Won't Fix per recommendation): documented the
  per-evaluation dictionary allocation in docs/v2/Galaxy.Performance.md
  with a mitigation path if a future soak surfaces pressure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-23 07:23:31 -04:00

8.5 KiB

Raw Blame History

Galaxy backend performance

This document covers the performance surface of the in-process GalaxyDriver (the v2 mxgw backend) — the ActivitySource it emits, the metrics on its EventPump, the soak scenario that validates it, and the tuning knobs you can reach for when the dev parity rig surfaces a hot spot.

Tracing surface (PR 6.1)

The driver emits spans on the ZB.MOM.WW.OtOpcUa.Driver.Galaxy ActivitySource. No package dependency on OpenTelemetry — the host process picks the listener (OTLP exporter, dotnet-trace, Application Insights). Wire it via OpenTelemetry.Trace.AddSource(...) in the host's tracing pipeline.

Span	Source	Tags
`galaxy.subscribe_bulk`	`TracedGalaxySubscriber`	`galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count`
`galaxy.unsubscribe_bulk`	`TracedGalaxySubscriber`	`galaxy.client`, `galaxy.tag_count`
`galaxy.stream_events`	`TracedGalaxySubscriber`	`galaxy.client`, `galaxy.event_count` (set on stream end)
`galaxy.write`	`TracedGalaxyDataWriter`	`galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count`
`galaxy.get_hierarchy`	`TracedGalaxyHierarchySource`	`galaxy.client`, `galaxy.object_count`

The stream-events span deliberately covers the entire stream lifetime rather than per-event spans — at 50k tags / 1Hz the per-event volume would dominate the trace pipeline. Per-event visibility flows through the metrics surface instead.

Metrics surface (PR 6.2)

EventPump publishes three counters on the ZB.MOM.WW.OtOpcUa.Driver.Galaxy meter, each tagged with galaxy.client so multi-driver hosts can split by source:

Counter	Unit	Meaning
`galaxy.events.received`	`{event}`	MxEvents read from the gateway StreamEvents stream
`galaxy.events.dispatched`	`{event}`	MxEvents that made it through the bounded channel into `OnDataChange`
`galaxy.events.dropped`	`{event}`	MxEvents discarded because the bounded channel was full (newest-dropped)

The invariant is received = dispatched + dropped + (in-flight in the channel). Watch the dropped counter — it is the leading indicator of listener back-pressure. A non-zero dropped rate means a downstream consumer (DriverNodeManager → UA notification queue → client) is slower than the gw event stream; investigate that consumer before raising EventPump channel capacity.

Bounded channel design

The pump runs two background tasks:

Producer — reads from IGalaxySubscriber.StreamEventsAsync, increments events.received, and TryWrites into a bounded Channel<MxEvent>. When the channel is full, the producer counts the drop and continues reading the gw stream so back-pressure does not propagate upstream (which would stall the gw worker and cascade to all driver instances sharing that worker).
Consumer — reads from the channel, fans out via SubscriptionRegistry, increments events.dispatched.

Default channel capacity is 50_000 (one second of headroom at 50k tags / 1Hz). Override via the EventPump constructor's channelCapacity parameter; the public-facing wiring path in GalaxyDriver.EnsureEventPumpStarted does not yet expose this through GalaxyDriverOptions because no parity scenario has needed it. Add it when soak data does.

Buffered update interval (PR 6.3)

MxAccess.PublishingIntervalMs (default 1000) flows through both subscribe paths:

GalaxyDriver.SubscribeAsync — the caller's publishingInterval wins when non-zero (the server's UA subscription publishingInterval drives this in production). When the caller passes TimeSpan.Zero, the configured option is the fallback.
PerPlatformProbeWatcher — the watcher passes the configured value through SubscribeBulkAsync so probe ScanState changes publish at the deployment's chosen cadence.

A session-level SetBufferedUpdateInterval RPC exists in the gw protocol but the .NET client doesn't expose a typed helper yet — adjusting an existing subscription's interval mid-flight is a follow-up. Today's path subscribes once at the right interval, which covers the common case.

Soak scenario (PR 6.4)

SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing in Driver.Galaxy.ParityTests is the long-running validation. It subscribes a configurable tag count (default 50_000), holds the subscription for a configurable duration (default 24h), polls the three counters every minute, and asserts:

events.received continues to grow (gw stream isn't stuck)
events.dropped / events.received stays under the configured ceiling (default 0.5%)
process working-set doesn't grow more than 1 GB above baseline (leak guard)

Always skipped unless the operator opts in:

# Full 24h × 50k soak (production validation)
OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/

# Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling)
OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \
  OTOPCUA_SOAK_DROP_PCT=1.0 \
  dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/

The scenario writes a per-minute CSV-style row to stdout (soak,<minutes>,received=…,dispatched=…,dropped=…,ws_mb=…) so an operator can grep the test runner output mid-run.

Tuned defaults (PR 6.5)

Option	Default	Source	Notes
`Gateway.ConnectTimeoutSeconds`	10	unchanged	Cold-start network paths fit comfortably; soak never observed >2s
`Gateway.DefaultCallTimeoutSeconds`	30	bumped from 5 in PR 6.5	A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker
`Gateway.StreamTimeoutSeconds`	0 (unlimited)	unchanged	The stream must run for the lifetime of the driver
`MxAccess.PublishingIntervalMs`	1000	unchanged	Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down
`Reconnect.InitialBackoffMs`	500	unchanged	First retry shouldn't dogpile a recovering gw
`Reconnect.MaxBackoffMs`	30_000	unchanged	30s ceiling so a long-down gw doesn't sit in 5+ min backoff
`Repository.DiscoverPageSize`	5000	unchanged	One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure
`EventPump` channel capacity	50_000	unchanged	One second of headroom at 50k tags / 1Hz

The unchanged rows are not "definitely correct" — they are "no live data argues for changing them." Re-run the soak scenario after every substantive driver change, and revise this table when the data does.

Where to look first when something's slow

Slow Discover? Inspect galaxy.get_hierarchy span duration and galaxy.object_count. The gw walks the Galaxy DB serially; slow Discovers usually mean a slow ZB SQL.
Subscribe pile-up? galaxy.subscribe_bulk span duration correlates with galaxy.tag_count. If duration ÷ tag_count starts climbing, the gw worker is probably under apartment-lock pressure.
Events stalled? Watch galaxy.events.received. Flat-lined means the gw stream is wedged — kick the reconnect supervisor by forcing a ReinitializeAsync.
Dropped events? Non-zero galaxy.events.dropped means a slow downstream consumer. Profile OnDataChange handlers in DriverNodeManager before bumping the channel capacity.
Memory growing? Confirm with the soak scenario's working-set leak guard. Likely culprits: lingering subscription handles in SubscriptionRegistry, or a downstream consumer retaining DataValueSnapshot references past their useful life.

Scripted-alarm engine — known hot-path allocations

ScriptedAlarmEngine.BuildReadCache allocates a fresh Dictionary<string, DataValueSnapshot> and AlarmPredicateContext on every predicate evaluation — i.e. once per upstream tag change per referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. (Core.ScriptedAlarms-009)

The allocations are deliberate for now: predicate evaluation is already serialised under _evalGate, so a single reused scratch dictionary would be safe, but the per-call dictionary keeps the evaluation surface immutable and trivially safe against future refactors. If a future scripted-alarm soak surfaces allocation pressure on this path, the mitigation is a per-alarm scratch buffer cleared between evaluations — note here before changing the engine.

8.5 KiB Raw Blame History Unescape Escape