Core.ScriptedAlarms-009 resolution: replace the per-call Dictionary +
AlarmPredicateContext allocation with a per-alarm reusable AlarmScratch
held in _scratchByAlarmId, refilled in place under _evalGate on each
evaluation. The hot path no longer allocates per upstream tag change.
Why this matters:
On a busy line where many tags feeding many alarms change frequently,
the old BuildReadCache allocated a fresh dictionary + context on every
predicate evaluation — a steady stream of short-lived allocations the
GC eventually has to reclaim. With the reuse, the dictionary and
context are allocated once per alarm (on first evaluation) and refilled
in place across every subsequent re-eval.
Implementation:
- New private AlarmScratch class holds the reusable
Dictionary<string, DataValueSnapshot> read cache (pre-sized to the
alarm's Inputs.Count) and the AlarmPredicateContext that wraps it by
reference. The context observes refilled values without being
re-created.
- ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId on the
engine, cleared in LoadAsync alongside _alarms so a config-publish
drops the prior generation's scratch (Inputs / Logger may change).
- EvaluatePredicateToStateAsync looks up scratch via GetOrAdd, calls
the new RefillReadCache(Dictionary, IReadOnlySet) helper to clear +
repopulate the dictionary in place, then runs the predicate against
the reused context.
- BuildReadCache removed.
Safety:
Reuse is serialised under _evalGate which guarantees no two threads
ever observe the same scratch in a half-refilled state. The
AlarmPredicateContext is bound to the scratch dictionary by reference,
so the predicate's ctx.GetTag(path) sees the freshly-refilled values
rather than a stale snapshot.
Verification:
- All 66 ScriptedAlarms tests pass (was 63 — three new regression tests
locking the reuse contract).
- All 56 VirtualTags tests still pass (unchanged).
- All 104 Core.Scripting tests still pass (unchanged).
New tests in ScriptedAlarmEngineTests:
- Reevaluation_reuses_the_same_read_cache_dictionary — asserts
ReferenceEquals(scratch_before, scratch_after) across two
evaluations of the same alarm.
- Reevaluation_reuses_the_same_predicate_context — same, for the
context.
- LoadAsync_drops_the_prior_generations_scratch — asserts a config
publish wipes the prior scratch (so a stale Logger / Inputs can't
leak into the new generation).
Internal test hooks TryGetScratchReadCacheForTest /
TryGetScratchContextForTest added via the existing
InternalsVisibleTo for the tests project. Kept internal — not part of
the public engine surface.
Docs:
- docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section
rewritten as "hot-path allocation reuse" documenting the new
contract + reuse safety reasoning + the three regression tests.
- code-reviews/Core.ScriptedAlarms/findings.md -009 flipped
Won't Fix → Resolved.
- code-reviews/README.md regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
162 lines
8.9 KiB
Markdown
162 lines
8.9 KiB
Markdown
# Galaxy backend performance
|
||
|
||
This document covers the performance surface of the in-process
|
||
`GalaxyDriver` (the v2 mxgw backend) — the ActivitySource it emits, the
|
||
metrics on its EventPump, the soak scenario that validates it, and the
|
||
tuning knobs you can reach for when the dev parity rig surfaces a hot
|
||
spot.
|
||
|
||
## Tracing surface (PR 6.1)
|
||
|
||
The driver emits spans on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy`
|
||
ActivitySource. No package dependency on OpenTelemetry — the host
|
||
process picks the listener (OTLP exporter, dotnet-trace, Application
|
||
Insights). Wire it via `OpenTelemetry.Trace.AddSource(...)` in the
|
||
host's tracing pipeline.
|
||
|
||
| Span | Source | Tags |
|
||
|------|--------|------|
|
||
| `galaxy.subscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count` |
|
||
| `galaxy.unsubscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count` |
|
||
| `galaxy.stream_events` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.event_count` (set on stream end) |
|
||
| `galaxy.write` | `TracedGalaxyDataWriter` | `galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count` |
|
||
| `galaxy.get_hierarchy` | `TracedGalaxyHierarchySource` | `galaxy.client`, `galaxy.object_count` |
|
||
|
||
The stream-events span deliberately covers the *entire* stream lifetime
|
||
rather than per-event spans — at 50k tags / 1Hz the per-event volume
|
||
would dominate the trace pipeline. Per-event visibility flows through
|
||
the metrics surface instead.
|
||
|
||
## Metrics surface (PR 6.2)
|
||
|
||
`EventPump` publishes three counters on the
|
||
`ZB.MOM.WW.OtOpcUa.Driver.Galaxy` meter, each tagged with
|
||
`galaxy.client` so multi-driver hosts can split by source:
|
||
|
||
| Counter | Unit | Meaning |
|
||
|---------|------|---------|
|
||
| `galaxy.events.received` | `{event}` | MxEvents read from the gateway StreamEvents stream |
|
||
| `galaxy.events.dispatched` | `{event}` | MxEvents that made it through the bounded channel into `OnDataChange` |
|
||
| `galaxy.events.dropped` | `{event}` | MxEvents discarded because the bounded channel was full (newest-dropped) |
|
||
|
||
The invariant is `received = dispatched + dropped + (in-flight in the
|
||
channel)`. Watch the dropped counter — it is the leading indicator of
|
||
listener back-pressure. A non-zero dropped rate means a downstream
|
||
consumer (DriverNodeManager → UA notification queue → client) is
|
||
slower than the gw event stream; investigate that consumer before
|
||
raising `EventPump` channel capacity.
|
||
|
||
### Bounded channel design
|
||
|
||
The pump runs two background tasks:
|
||
|
||
1. **Producer** — reads from `IGalaxySubscriber.StreamEventsAsync`,
|
||
increments `events.received`, and `TryWrite`s into a bounded
|
||
`Channel<MxEvent>`. When the channel is full, the producer counts
|
||
the drop and continues reading the gw stream so back-pressure does
|
||
not propagate upstream (which would stall the gw worker and cascade
|
||
to *all* driver instances sharing that worker).
|
||
2. **Consumer** — reads from the channel, fans out via
|
||
`SubscriptionRegistry`, increments `events.dispatched`.
|
||
|
||
Default channel capacity is 50_000 (one second of headroom at 50k
|
||
tags / 1Hz). Override via the `EventPump` constructor's
|
||
`channelCapacity` parameter; the public-facing wiring path in
|
||
`GalaxyDriver.EnsureEventPumpStarted` does not yet expose this through
|
||
`GalaxyDriverOptions` because no parity scenario has needed it. Add it
|
||
when soak data does.
|
||
|
||
## Buffered update interval (PR 6.3)
|
||
|
||
`MxAccess.PublishingIntervalMs` (default 1000) flows through both
|
||
subscribe paths:
|
||
|
||
- `GalaxyDriver.SubscribeAsync` — the caller's `publishingInterval`
|
||
wins when non-zero (the server's UA subscription publishingInterval
|
||
drives this in production). When the caller passes
|
||
`TimeSpan.Zero`, the configured option is the fallback.
|
||
- `PerPlatformProbeWatcher` — the watcher passes the configured value
|
||
through `SubscribeBulkAsync` so probe `ScanState` changes publish at
|
||
the deployment's chosen cadence.
|
||
|
||
A session-level `SetBufferedUpdateInterval` RPC exists in the gw
|
||
protocol but the .NET client doesn't expose a typed helper yet —
|
||
adjusting an existing subscription's interval mid-flight is a
|
||
follow-up. Today's path subscribes once at the right interval, which
|
||
covers the common case.
|
||
|
||
## Soak scenario (PR 6.4)
|
||
|
||
`SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing`
|
||
in `Driver.Galaxy.ParityTests` is the long-running validation. It
|
||
subscribes a configurable tag count (default 50_000), holds the
|
||
subscription for a configurable duration (default 24h), polls the
|
||
three counters every minute, and asserts:
|
||
|
||
- `events.received` continues to grow (gw stream isn't stuck)
|
||
- `events.dropped / events.received` stays under the configured
|
||
ceiling (default 0.5%)
|
||
- process working-set doesn't grow more than 1 GB above baseline
|
||
(leak guard)
|
||
|
||
Always skipped unless the operator opts in:
|
||
|
||
```bash
|
||
# Full 24h × 50k soak (production validation)
|
||
OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
|
||
|
||
# Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling)
|
||
OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \
|
||
OTOPCUA_SOAK_DROP_PCT=1.0 \
|
||
dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
|
||
```
|
||
|
||
The scenario writes a per-minute CSV-style row to stdout
|
||
(`soak,<minutes>,received=…,dispatched=…,dropped=…,ws_mb=…`) so an
|
||
operator can grep the test runner output mid-run.
|
||
|
||
## Tuned defaults (PR 6.5)
|
||
|
||
| Option | Default | Source | Notes |
|
||
|--------|---------|--------|-------|
|
||
| `Gateway.ConnectTimeoutSeconds` | 10 | unchanged | Cold-start network paths fit comfortably; soak never observed >2s |
|
||
| `Gateway.DefaultCallTimeoutSeconds` | 30 | **bumped from 5** in PR 6.5 | A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker |
|
||
| `Gateway.StreamTimeoutSeconds` | 0 (unlimited) | unchanged | The stream must run for the lifetime of the driver |
|
||
| `MxAccess.PublishingIntervalMs` | 1000 | unchanged | Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down |
|
||
| `Reconnect.InitialBackoffMs` | 500 | unchanged | First retry shouldn't dogpile a recovering gw |
|
||
| `Reconnect.MaxBackoffMs` | 30_000 | unchanged | 30s ceiling so a long-down gw doesn't sit in 5+ min backoff |
|
||
| `Repository.DiscoverPageSize` | 5000 | unchanged | One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure |
|
||
| `EventPump` channel capacity | 50_000 | unchanged | One second of headroom at 50k tags / 1Hz |
|
||
|
||
The unchanged rows are not "definitely correct" — they are "no live
|
||
data argues for changing them." Re-run the soak scenario after every
|
||
substantive driver change, and revise this table when the data does.
|
||
|
||
## Where to look first when something's slow
|
||
|
||
1. **Slow `Discover`?** Inspect `galaxy.get_hierarchy` span duration
|
||
and `galaxy.object_count`. The gw walks the Galaxy DB serially;
|
||
slow Discovers usually mean a slow ZB SQL.
|
||
2. **Subscribe pile-up?** `galaxy.subscribe_bulk` span duration
|
||
correlates with `galaxy.tag_count`. If duration ÷ tag_count starts
|
||
climbing, the gw worker is probably under apartment-lock pressure.
|
||
3. **Events stalled?** Watch `galaxy.events.received`. Flat-lined
|
||
means the gw stream is wedged — kick the reconnect supervisor by
|
||
forcing a `ReinitializeAsync`.
|
||
4. **Dropped events?** Non-zero `galaxy.events.dropped` means a slow
|
||
downstream consumer. Profile `OnDataChange` handlers in
|
||
`DriverNodeManager` before bumping the channel capacity.
|
||
5. **Memory growing?** Confirm with the soak scenario's working-set
|
||
leak guard. Likely culprits: lingering subscription handles in
|
||
`SubscriptionRegistry`, or a downstream consumer retaining
|
||
`DataValueSnapshot` references past their useful life.
|
||
|
||
## Scripted-alarm engine — hot-path allocation reuse
|
||
|
||
`ScriptedAlarmEngine` keeps a per-alarm reusable evaluation scratch in `_scratchByAlarmId` — the read-cache `Dictionary<string, DataValueSnapshot>` and the `AlarmPredicateContext` are allocated once per alarm (on first evaluation) and refilled in place across every subsequent predicate evaluation. The hot path no longer allocates a fresh dictionary + context per upstream tag change. (Core.ScriptedAlarms-009)
|
||
|
||
Safety: reuse is serialised under `_evalGate`, so two threads can never observe the same scratch in a half-refilled state. The context wraps the read-cache by reference, so refilling the dictionary is what the predicate's `ctx.GetTag(path)` calls observe. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a config-publish drops the prior generation's scratch (a new generation may carry different `Inputs` / `Logger`). Regression tests in `ScriptedAlarmEngineTests` lock the reuse contract:
|
||
- `Reevaluation_reuses_the_same_read_cache_dictionary` — asserts dictionary instance identity across two evaluations.
|
||
- `Reevaluation_reuses_the_same_predicate_context` — same, for the context.
|
||
- `LoadAsync_drops_the_prior_generations_scratch` — asserts a publish resets the scratch.
|