# Galaxy backend performance This document covers the performance surface of the in-process `GalaxyDriver` (the v2 mxgw backend) — the ActivitySource it emits, the metrics on its EventPump, the soak scenario that validates it, and the tuning knobs you can reach for when the dev parity rig surfaces a hot spot. ## Tracing surface (PR 6.1) The driver emits spans on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy` ActivitySource. No package dependency on OpenTelemetry — the host process picks the listener (OTLP exporter, dotnet-trace, Application Insights). Wire it via `OpenTelemetry.Trace.AddSource(...)` in the host's tracing pipeline. | Span | Source | Tags | |------|--------|------| | `galaxy.subscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count` | | `galaxy.unsubscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count` | | `galaxy.stream_events` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.event_count` (set on stream end) | | `galaxy.write` | `TracedGalaxyDataWriter` | `galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count` | | `galaxy.get_hierarchy` | `TracedGalaxyHierarchySource` | `galaxy.client`, `galaxy.object_count` | The stream-events span deliberately covers the *entire* stream lifetime rather than per-event spans — at 50k tags / 1Hz the per-event volume would dominate the trace pipeline. Per-event visibility flows through the metrics surface instead. ## Metrics surface (PR 6.2) `EventPump` publishes three counters on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy` meter, each tagged with `galaxy.client` so multi-driver hosts can split by source: | Counter | Unit | Meaning | |---------|------|---------| | `galaxy.events.received` | `{event}` | MxEvents read from the gateway StreamEvents stream | | `galaxy.events.dispatched` | `{event}` | MxEvents that made it through the bounded channel into `OnDataChange` | | `galaxy.events.dropped` | `{event}` | MxEvents discarded because the bounded channel was full (newest-dropped) | The invariant is `received = dispatched + dropped + (in-flight in the channel)`. Watch the dropped counter — it is the leading indicator of listener back-pressure. A non-zero dropped rate means a downstream consumer (DriverNodeManager → UA notification queue → client) is slower than the gw event stream; investigate that consumer before raising `EventPump` channel capacity. ### Bounded channel design The pump runs two background tasks: 1. **Producer** — reads from `IGalaxySubscriber.StreamEventsAsync`, increments `events.received`, and `TryWrite`s into a bounded `Channel`. When the channel is full, the producer counts the drop and continues reading the gw stream so back-pressure does not propagate upstream (which would stall the gw worker and cascade to *all* driver instances sharing that worker). 2. **Consumer** — reads from the channel, fans out via `SubscriptionRegistry`, increments `events.dispatched`. Default channel capacity is 50_000 (one second of headroom at 50k tags / 1Hz). Override via the `EventPump` constructor's `channelCapacity` parameter; the public-facing wiring path in `GalaxyDriver.EnsureEventPumpStarted` does not yet expose this through `GalaxyDriverOptions` because no parity scenario has needed it. Add it when soak data does. ## Buffered update interval (PR 6.3) `MxAccess.PublishingIntervalMs` (default 1000) flows through both subscribe paths: - `GalaxyDriver.SubscribeAsync` — the caller's `publishingInterval` wins when non-zero (the server's UA subscription publishingInterval drives this in production). When the caller passes `TimeSpan.Zero`, the configured option is the fallback. - `PerPlatformProbeWatcher` — the watcher passes the configured value through `SubscribeBulkAsync` so probe `ScanState` changes publish at the deployment's chosen cadence. A session-level `SetBufferedUpdateInterval` RPC exists in the gw protocol but the .NET client doesn't expose a typed helper yet — adjusting an existing subscription's interval mid-flight is a follow-up. Today's path subscribes once at the right interval, which covers the common case. ## Soak scenario (PR 6.4) `SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing` in `Driver.Galaxy.ParityTests` is the long-running validation. It subscribes a configurable tag count (default 50_000), holds the subscription for a configurable duration (default 24h), polls the three counters every minute, and asserts: - `events.received` continues to grow (gw stream isn't stuck) - `events.dropped / events.received` stays under the configured ceiling (default 0.5%) - process working-set doesn't grow more than 1 GB above baseline (leak guard) Always skipped unless the operator opts in: ```bash # Full 24h × 50k soak (production validation) OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/ # Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling) OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \ OTOPCUA_SOAK_DROP_PCT=1.0 \ dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/ ``` The scenario writes a per-minute CSV-style row to stdout (`soak,,received=…,dispatched=…,dropped=…,ws_mb=…`) so an operator can grep the test runner output mid-run. ## Tuned defaults (PR 6.5) | Option | Default | Source | Notes | |--------|---------|--------|-------| | `Gateway.ConnectTimeoutSeconds` | 10 | unchanged | Cold-start network paths fit comfortably; soak never observed >2s | | `Gateway.DefaultCallTimeoutSeconds` | 30 | **bumped from 5** in PR 6.5 | A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker | | `Gateway.StreamTimeoutSeconds` | 0 (unlimited) | unchanged | The stream must run for the lifetime of the driver | | `MxAccess.PublishingIntervalMs` | 1000 | unchanged | Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down | | `Reconnect.InitialBackoffMs` | 500 | unchanged | First retry shouldn't dogpile a recovering gw | | `Reconnect.MaxBackoffMs` | 30_000 | unchanged | 30s ceiling so a long-down gw doesn't sit in 5+ min backoff | | `Repository.DiscoverPageSize` | 5000 | unchanged | One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure | | `EventPump` channel capacity | 50_000 | unchanged | One second of headroom at 50k tags / 1Hz | The unchanged rows are not "definitely correct" — they are "no live data argues for changing them." Re-run the soak scenario after every substantive driver change, and revise this table when the data does. ## Where to look first when something's slow 1. **Slow `Discover`?** Inspect `galaxy.get_hierarchy` span duration and `galaxy.object_count`. The gw walks the Galaxy DB serially; slow Discovers usually mean a slow ZB SQL. 2. **Subscribe pile-up?** `galaxy.subscribe_bulk` span duration correlates with `galaxy.tag_count`. If duration ÷ tag_count starts climbing, the gw worker is probably under apartment-lock pressure. 3. **Events stalled?** Watch `galaxy.events.received`. Flat-lined means the gw stream is wedged — kick the reconnect supervisor by forcing a `ReinitializeAsync`. 4. **Dropped events?** Non-zero `galaxy.events.dropped` means a slow downstream consumer. Profile `OnDataChange` handlers in `DriverNodeManager` before bumping the channel capacity. 5. **Memory growing?** Confirm with the soak scenario's working-set leak guard. Likely culprits: lingering subscription handles in `SubscriptionRegistry`, or a downstream consumer retaining `DataValueSnapshot` references past their useful life.