diff --git a/docs/v2/Galaxy.Performance.md b/docs/v2/Galaxy.Performance.md new file mode 100644 index 0000000..33ef145 --- /dev/null +++ b/docs/v2/Galaxy.Performance.md @@ -0,0 +1,152 @@ +# Galaxy backend performance + +This document covers the performance surface of the in-process +`GalaxyDriver` (the v2 mxgw backend) — the ActivitySource it emits, the +metrics on its EventPump, the soak scenario that validates it, and the +tuning knobs you can reach for when the dev parity rig surfaces a hot +spot. + +## Tracing surface (PR 6.1) + +The driver emits spans on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy` +ActivitySource. No package dependency on OpenTelemetry — the host +process picks the listener (OTLP exporter, dotnet-trace, Application +Insights). Wire it via `OpenTelemetry.Trace.AddSource(...)` in the +host's tracing pipeline. + +| Span | Source | Tags | +|------|--------|------| +| `galaxy.subscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count` | +| `galaxy.unsubscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count` | +| `galaxy.stream_events` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.event_count` (set on stream end) | +| `galaxy.write` | `TracedGalaxyDataWriter` | `galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count` | +| `galaxy.get_hierarchy` | `TracedGalaxyHierarchySource` | `galaxy.client`, `galaxy.object_count` | + +The stream-events span deliberately covers the *entire* stream lifetime +rather than per-event spans — at 50k tags / 1Hz the per-event volume +would dominate the trace pipeline. Per-event visibility flows through +the metrics surface instead. + +## Metrics surface (PR 6.2) + +`EventPump` publishes three counters on the +`ZB.MOM.WW.OtOpcUa.Driver.Galaxy` meter, each tagged with +`galaxy.client` so multi-driver hosts can split by source: + +| Counter | Unit | Meaning | +|---------|------|---------| +| `galaxy.events.received` | `{event}` | MxEvents read from the gateway StreamEvents stream | +| `galaxy.events.dispatched` | `{event}` | MxEvents that made it through the bounded channel into `OnDataChange` | +| `galaxy.events.dropped` | `{event}` | MxEvents discarded because the bounded channel was full (newest-dropped) | + +The invariant is `received = dispatched + dropped + (in-flight in the +channel)`. Watch the dropped counter — it is the leading indicator of +listener back-pressure. A non-zero dropped rate means a downstream +consumer (DriverNodeManager → UA notification queue → client) is +slower than the gw event stream; investigate that consumer before +raising `EventPump` channel capacity. + +### Bounded channel design + +The pump runs two background tasks: + +1. **Producer** — reads from `IGalaxySubscriber.StreamEventsAsync`, + increments `events.received`, and `TryWrite`s into a bounded + `Channel`. When the channel is full, the producer counts + the drop and continues reading the gw stream so back-pressure does + not propagate upstream (which would stall the gw worker and cascade + to *all* driver instances sharing that worker). +2. **Consumer** — reads from the channel, fans out via + `SubscriptionRegistry`, increments `events.dispatched`. + +Default channel capacity is 50_000 (one second of headroom at 50k +tags / 1Hz). Override via the `EventPump` constructor's +`channelCapacity` parameter; the public-facing wiring path in +`GalaxyDriver.EnsureEventPumpStarted` does not yet expose this through +`GalaxyDriverOptions` because no parity scenario has needed it. Add it +when soak data does. + +## Buffered update interval (PR 6.3) + +`MxAccess.PublishingIntervalMs` (default 1000) flows through both +subscribe paths: + +- `GalaxyDriver.SubscribeAsync` — the caller's `publishingInterval` + wins when non-zero (the server's UA subscription publishingInterval + drives this in production). When the caller passes + `TimeSpan.Zero`, the configured option is the fallback. +- `PerPlatformProbeWatcher` — the watcher passes the configured value + through `SubscribeBulkAsync` so probe `ScanState` changes publish at + the deployment's chosen cadence. + +A session-level `SetBufferedUpdateInterval` RPC exists in the gw +protocol but the .NET client doesn't expose a typed helper yet — +adjusting an existing subscription's interval mid-flight is a +follow-up. Today's path subscribes once at the right interval, which +covers the common case. + +## Soak scenario (PR 6.4) + +`SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing` +in `Driver.Galaxy.ParityTests` is the long-running validation. It +subscribes a configurable tag count (default 50_000), holds the +subscription for a configurable duration (default 24h), polls the +three counters every minute, and asserts: + +- `events.received` continues to grow (gw stream isn't stuck) +- `events.dropped / events.received` stays under the configured + ceiling (default 0.5%) +- process working-set doesn't grow more than 1 GB above baseline + (leak guard) + +Always skipped unless the operator opts in: + +```bash +# Full 24h × 50k soak (production validation) +OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/ + +# Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling) +OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \ + OTOPCUA_SOAK_DROP_PCT=1.0 \ + dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/ +``` + +The scenario writes a per-minute CSV-style row to stdout +(`soak,,received=…,dispatched=…,dropped=…,ws_mb=…`) so an +operator can grep the test runner output mid-run. + +## Tuned defaults (PR 6.5) + +| Option | Default | Source | Notes | +|--------|---------|--------|-------| +| `Gateway.ConnectTimeoutSeconds` | 10 | unchanged | Cold-start network paths fit comfortably; soak never observed >2s | +| `Gateway.DefaultCallTimeoutSeconds` | 30 | **bumped from 5** in PR 6.5 | A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker | +| `Gateway.StreamTimeoutSeconds` | 0 (unlimited) | unchanged | The stream must run for the lifetime of the driver | +| `MxAccess.PublishingIntervalMs` | 1000 | unchanged | Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down | +| `Reconnect.InitialBackoffMs` | 500 | unchanged | First retry shouldn't dogpile a recovering gw | +| `Reconnect.MaxBackoffMs` | 30_000 | unchanged | 30s ceiling so a long-down gw doesn't sit in 5+ min backoff | +| `Repository.DiscoverPageSize` | 5000 | unchanged | One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure | +| `EventPump` channel capacity | 50_000 | unchanged | One second of headroom at 50k tags / 1Hz | + +The unchanged rows are not "definitely correct" — they are "no live +data argues for changing them." Re-run the soak scenario after every +substantive driver change, and revise this table when the data does. + +## Where to look first when something's slow + +1. **Slow `Discover`?** Inspect `galaxy.get_hierarchy` span duration + and `galaxy.object_count`. The gw walks the Galaxy DB serially; + slow Discovers usually mean a slow ZB SQL. +2. **Subscribe pile-up?** `galaxy.subscribe_bulk` span duration + correlates with `galaxy.tag_count`. If duration ÷ tag_count starts + climbing, the gw worker is probably under apartment-lock pressure. +3. **Events stalled?** Watch `galaxy.events.received`. Flat-lined + means the gw stream is wedged — kick the reconnect supervisor by + forcing a `ReinitializeAsync`. +4. **Dropped events?** Non-zero `galaxy.events.dropped` means a slow + downstream consumer. Profile `OnDataChange` handlers in + `DriverNodeManager` before bumping the channel capacity. +5. **Memory growing?** Confirm with the soak scenario's working-set + leak guard. Likely culprits: lingering subscription handles in + `SubscriptionRegistry`, or a downstream consumer retaining + `DataValueSnapshot` references past their useful life.