lmxopcua/docs/v2/Galaxy.Performance.md

# Galaxy backend performance

This document covers the performance surface of the in-process
`GalaxyDriver` (the v2 mxgw backend) — the ActivitySource it emits, the
metrics on its EventPump, the soak scenario that validates it, and the
tuning knobs you can reach for when the dev parity rig surfaces a hot
spot.

## Tracing surface (PR 6.1)

The driver emits spans on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy`
ActivitySource. No package dependency on OpenTelemetry — the host
process picks the listener (OTLP exporter, dotnet-trace, Application
Insights). Wire it via `OpenTelemetry.Trace.AddSource(...)` in the
host's tracing pipeline.

| Span | Source | Tags |
|------|--------|------|
| `galaxy.subscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count` |
| `galaxy.unsubscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count` |
| `galaxy.stream_events` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.event_count` (set on stream end) |
| `galaxy.write` | `TracedGalaxyDataWriter` | `galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count` |
| `galaxy.get_hierarchy` | `TracedGalaxyHierarchySource` | `galaxy.client`, `galaxy.object_count` |

The stream-events span deliberately covers the *entire* stream lifetime
rather than per-event spans — at 50k tags / 1Hz the per-event volume
would dominate the trace pipeline. Per-event visibility flows through
the metrics surface instead.

## Metrics surface (PR 6.2)

`EventPump` publishes three counters on the
`ZB.MOM.WW.OtOpcUa.Driver.Galaxy` meter, each tagged with
`galaxy.client` so multi-driver hosts can split by source:

| Counter | Unit | Meaning |
|---------|------|---------|
| `galaxy.events.received` | `{event}` | MxEvents read from the gateway StreamEvents stream |
| `galaxy.events.dispatched` | `{event}` | MxEvents that made it through the bounded channel into `OnDataChange` |
| `galaxy.events.dropped` | `{event}` | MxEvents discarded because the bounded channel was full (newest-dropped) |

The invariant is `received = dispatched + dropped + (in-flight in the
channel)`. Watch the dropped counter — it is the leading indicator of
listener back-pressure. A non-zero dropped rate means a downstream
consumer (DriverNodeManager → UA notification queue → client) is
slower than the gw event stream; investigate that consumer before
raising `EventPump` channel capacity.

### Bounded channel design

The pump runs two background tasks:

1. **Producer** — reads from `IGalaxySubscriber.StreamEventsAsync`,
   increments `events.received`, and `TryWrite`s into a bounded
   `Channel<MxEvent>`. When the channel is full, the producer counts
   the drop and continues reading the gw stream so back-pressure does
   not propagate upstream (which would stall the gw worker and cascade
   to *all* driver instances sharing that worker).
2. **Consumer** — reads from the channel, fans out via
   `SubscriptionRegistry`, increments `events.dispatched`.

Default channel capacity is 50_000 (one second of headroom at 50k
tags / 1Hz). Override via the `EventPump` constructor's
`channelCapacity` parameter; the public-facing wiring path in
`GalaxyDriver.EnsureEventPumpStarted` does not yet expose this through
`GalaxyDriverOptions` because no parity scenario has needed it. Add it
when soak data does.

## Buffered update interval (PR 6.3)

`MxAccess.PublishingIntervalMs` (default 1000) flows through both
subscribe paths:

- `GalaxyDriver.SubscribeAsync` — the caller's `publishingInterval`
  wins when non-zero (the server's UA subscription publishingInterval
  drives this in production). When the caller passes
  `TimeSpan.Zero`, the configured option is the fallback.
- `PerPlatformProbeWatcher` — the watcher passes the configured value
  through `SubscribeBulkAsync` so probe `ScanState` changes publish at
  the deployment's chosen cadence.

A session-level `SetBufferedUpdateInterval` RPC exists in the gw
protocol but the .NET client doesn't expose a typed helper yet —
adjusting an existing subscription's interval mid-flight is a
follow-up. Today's path subscribes once at the right interval, which
covers the common case.

## Soak scenario (PR 6.4)

`SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing`
in `Driver.Galaxy.ParityTests` is the long-running validation. It
subscribes a configurable tag count (default 50_000), holds the
subscription for a configurable duration (default 24h), polls the
three counters every minute, and asserts:

- `events.received` continues to grow (gw stream isn't stuck)
- `events.dropped / events.received` stays under the configured
  ceiling (default 0.5%)
- process working-set doesn't grow more than 1 GB above baseline
  (leak guard)

Always skipped unless the operator opts in:

```bash
# Full 24h × 50k soak (production validation)
OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/

# Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling)
OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \
  OTOPCUA_SOAK_DROP_PCT=1.0 \
  dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
```

The scenario writes a per-minute CSV-style row to stdout
(`soak,<minutes>,received=…,dispatched=…,dropped=…,ws_mb=…`) so an
operator can grep the test runner output mid-run.

## Tuned defaults (PR 6.5)

| Option | Default | Source | Notes |
|--------|---------|--------|-------|
| `Gateway.ConnectTimeoutSeconds` | 10 | unchanged | Cold-start network paths fit comfortably; soak never observed >2s |
| `Gateway.DefaultCallTimeoutSeconds` | 30 | **bumped from 5** in PR 6.5 | A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker |
| `Gateway.StreamTimeoutSeconds` | 0 (unlimited) | unchanged | The stream must run for the lifetime of the driver |
| `MxAccess.PublishingIntervalMs` | 1000 | unchanged | Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down |
| `Reconnect.InitialBackoffMs` | 500 | unchanged | First retry shouldn't dogpile a recovering gw |
| `Reconnect.MaxBackoffMs` | 30_000 | unchanged | 30s ceiling so a long-down gw doesn't sit in 5+ min backoff |
| `Repository.DiscoverPageSize` | 5000 | unchanged | One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure |
| `EventPump` channel capacity | 50_000 | unchanged | One second of headroom at 50k tags / 1Hz |

The unchanged rows are not "definitely correct" — they are "no live
data argues for changing them." Re-run the soak scenario after every
substantive driver change, and revise this table when the data does.

## Where to look first when something's slow

1. **Slow `Discover`?** Inspect `galaxy.get_hierarchy` span duration
   and `galaxy.object_count`. The gw walks the Galaxy DB serially;
   slow Discovers usually mean a slow ZB SQL.
2. **Subscribe pile-up?** `galaxy.subscribe_bulk` span duration
   correlates with `galaxy.tag_count`. If duration ÷ tag_count starts
   climbing, the gw worker is probably under apartment-lock pressure.
3. **Events stalled?** Watch `galaxy.events.received`. Flat-lined
   means the gw stream is wedged — kick the reconnect supervisor by
   forcing a `ReinitializeAsync`.
4. **Dropped events?** Non-zero `galaxy.events.dropped` means a slow
   downstream consumer. Profile `OnDataChange` handlers in
   `DriverNodeManager` before bumping the channel capacity.
5. **Memory growing?** Confirm with the soak scenario's working-set
   leak guard. Likely culprits: lingering subscription handles in
   `SubscriptionRegistry`, or a downstream consumer retaining
   `DataValueSnapshot` references past their useful life.

## Scripted-alarm engine — hot-path allocation reuse

`ScriptedAlarmEngine` keeps a per-alarm reusable evaluation scratch in `_scratchByAlarmId` — the read-cache `Dictionary<string, DataValueSnapshot>` and the `AlarmPredicateContext` are allocated once per alarm (on first evaluation) and refilled in place across every subsequent predicate evaluation. The hot path no longer allocates a fresh dictionary + context per upstream tag change. (Core.ScriptedAlarms-009)

Safety: reuse is serialised under `_evalGate`, so two threads can never observe the same scratch in a half-refilled state. The context wraps the read-cache by reference, so refilling the dictionary is what the predicate's `ctx.GetTag(path)` calls observe. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a config-publish drops the prior generation's scratch (a new generation may carry different `Inputs` / `Logger`). Regression tests in `ScriptedAlarmEngineTests` lock the reuse contract:
- `Reevaluation_reuses_the_same_read_cache_dictionary` — asserts dictionary instance identity across two evaluations.
- `Reevaluation_reuses_the_same_predicate_context` — same, for the context.
- `LoadAsync_drops_the_prior_generations_scratch` — asserts a publish resets the scratch.