PR 6.W — Galaxy.Performance.md

Documents the four perf surfaces shipped in Phase 6: - Tracing surface (PR 6.1) — table of every span the driver emits + rationale for stream-level (not per-event) coverage. - Metrics surface (PR 6.2) — three EventPump counters, tagging scheme, the bounded-channel design, and the received = dispatched + dropped + in-flight invariant. - Buffered update interval (PR 6.3) — how MxAccess.PublishingIntervalMs flows through both subscribe paths and what's still pending on the gw side (typed SetBufferedUpdateInterval helper). - Soak scenario (PR 6.4) — env-var-gated 24h × 50k validation with the CI-compressed override recipe. - Tuned defaults (PR 6.5) — table of every default with source + notes; rows marked "unchanged" carry the explicit "no live data argues for changing this" caveat. Closes with a "where to look first when something's slow" runbook section so on-call doesn't have to re-derive the trace+metric correlation map from primary docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:23 -04:00
parent 22ef2eb5ba
commit edee47d77f
1 changed files with 152 additions and 0 deletions
--- a/docs/v2/Galaxy.Performance.md
+++ b/docs/v2/Galaxy.Performance.md
@@ -0,0 +1,152 @@
+# Galaxy backend performance
+
+This document covers the performance surface of the in-process
+`GalaxyDriver` (the v2 mxgw backend) — the ActivitySource it emits, the
+metrics on its EventPump, the soak scenario that validates it, and the
+tuning knobs you can reach for when the dev parity rig surfaces a hot
+spot.
+
+## Tracing surface (PR 6.1)
+
+The driver emits spans on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy`
+ActivitySource. No package dependency on OpenTelemetry — the host
+process picks the listener (OTLP exporter, dotnet-trace, Application
+Insights). Wire it via `OpenTelemetry.Trace.AddSource(...)` in the
+host's tracing pipeline.
+
+| Span | Source | Tags |
+|------|--------|------|
+| `galaxy.subscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count` |
+| `galaxy.unsubscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count` |
+| `galaxy.stream_events` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.event_count` (set on stream end) |
+| `galaxy.write` | `TracedGalaxyDataWriter` | `galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count` |
+| `galaxy.get_hierarchy` | `TracedGalaxyHierarchySource` | `galaxy.client`, `galaxy.object_count` |
+
+The stream-events span deliberately covers the *entire* stream lifetime
+rather than per-event spans — at 50k tags / 1Hz the per-event volume
+would dominate the trace pipeline. Per-event visibility flows through
+the metrics surface instead.
+
+## Metrics surface (PR 6.2)
+
+`EventPump` publishes three counters on the
+`ZB.MOM.WW.OtOpcUa.Driver.Galaxy` meter, each tagged with
+`galaxy.client` so multi-driver hosts can split by source:
+
+| Counter | Unit | Meaning |
+|---------|------|---------|
+| `galaxy.events.received` | `{event}` | MxEvents read from the gateway StreamEvents stream |
+| `galaxy.events.dispatched` | `{event}` | MxEvents that made it through the bounded channel into `OnDataChange` |
+| `galaxy.events.dropped` | `{event}` | MxEvents discarded because the bounded channel was full (newest-dropped) |
+
+The invariant is `received = dispatched + dropped + (in-flight in the
+channel)`. Watch the dropped counter — it is the leading indicator of
+listener back-pressure. A non-zero dropped rate means a downstream
+consumer (DriverNodeManager → UA notification queue → client) is
+slower than the gw event stream; investigate that consumer before
+raising `EventPump` channel capacity.
+
+### Bounded channel design
+
+The pump runs two background tasks:
+
+1. **Producer** — reads from `IGalaxySubscriber.StreamEventsAsync`,
+   increments `events.received`, and `TryWrite`s into a bounded
+   `Channel<MxEvent>`. When the channel is full, the producer counts
+   the drop and continues reading the gw stream so back-pressure does
+   not propagate upstream (which would stall the gw worker and cascade
+   to *all* driver instances sharing that worker).
+2. **Consumer** — reads from the channel, fans out via
+   `SubscriptionRegistry`, increments `events.dispatched`.
+
+Default channel capacity is 50_000 (one second of headroom at 50k
+tags / 1Hz). Override via the `EventPump` constructor's
+`channelCapacity` parameter; the public-facing wiring path in
+`GalaxyDriver.EnsureEventPumpStarted` does not yet expose this through
+`GalaxyDriverOptions` because no parity scenario has needed it. Add it
+when soak data does.
+
+## Buffered update interval (PR 6.3)
+
+`MxAccess.PublishingIntervalMs` (default 1000) flows through both
+subscribe paths:
+
+- `GalaxyDriver.SubscribeAsync` — the caller's `publishingInterval`
+  wins when non-zero (the server's UA subscription publishingInterval
+  drives this in production). When the caller passes
+  `TimeSpan.Zero`, the configured option is the fallback.
+- `PerPlatformProbeWatcher` — the watcher passes the configured value
+  through `SubscribeBulkAsync` so probe `ScanState` changes publish at
+  the deployment's chosen cadence.
+
+A session-level `SetBufferedUpdateInterval` RPC exists in the gw
+protocol but the .NET client doesn't expose a typed helper yet —
+adjusting an existing subscription's interval mid-flight is a
+follow-up. Today's path subscribes once at the right interval, which
+covers the common case.
+
+## Soak scenario (PR 6.4)
+
+`SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing`
+in `Driver.Galaxy.ParityTests` is the long-running validation. It
+subscribes a configurable tag count (default 50_000), holds the
+subscription for a configurable duration (default 24h), polls the
+three counters every minute, and asserts:
+
+- `events.received` continues to grow (gw stream isn't stuck)
+- `events.dropped / events.received` stays under the configured
+  ceiling (default 0.5%)
+- process working-set doesn't grow more than 1 GB above baseline
+  (leak guard)
+
+Always skipped unless the operator opts in:
+
+```bash
+# Full 24h × 50k soak (production validation)
+OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
+
+# Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling)
+OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \
+  OTOPCUA_SOAK_DROP_PCT=1.0 \
+  dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
+```
+
+The scenario writes a per-minute CSV-style row to stdout
+(`soak,<minutes>,received=…,dispatched=…,dropped=…,ws_mb=…`) so an
+operator can grep the test runner output mid-run.
+
+## Tuned defaults (PR 6.5)
+
+| Option | Default | Source | Notes |
+|--------|---------|--------|-------|
+| `Gateway.ConnectTimeoutSeconds` | 10 | unchanged | Cold-start network paths fit comfortably; soak never observed >2s |
+| `Gateway.DefaultCallTimeoutSeconds` | 30 | **bumped from 5** in PR 6.5 | A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker |
+| `Gateway.StreamTimeoutSeconds` | 0 (unlimited) | unchanged | The stream must run for the lifetime of the driver |
+| `MxAccess.PublishingIntervalMs` | 1000 | unchanged | Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down |
+| `Reconnect.InitialBackoffMs` | 500 | unchanged | First retry shouldn't dogpile a recovering gw |
+| `Reconnect.MaxBackoffMs` | 30_000 | unchanged | 30s ceiling so a long-down gw doesn't sit in 5+ min backoff |
+| `Repository.DiscoverPageSize` | 5000 | unchanged | One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure |
+| `EventPump` channel capacity | 50_000 | unchanged | One second of headroom at 50k tags / 1Hz |
+
+The unchanged rows are not "definitely correct" — they are "no live
+data argues for changing them." Re-run the soak scenario after every
+substantive driver change, and revise this table when the data does.
+
+## Where to look first when something's slow
+
+1. **Slow `Discover`?** Inspect `galaxy.get_hierarchy` span duration
+   and `galaxy.object_count`. The gw walks the Galaxy DB serially;
+   slow Discovers usually mean a slow ZB SQL.
+2. **Subscribe pile-up?** `galaxy.subscribe_bulk` span duration
+   correlates with `galaxy.tag_count`. If duration ÷ tag_count starts
+   climbing, the gw worker is probably under apartment-lock pressure.
+3. **Events stalled?** Watch `galaxy.events.received`. Flat-lined
+   means the gw stream is wedged — kick the reconnect supervisor by
+   forcing a `ReinitializeAsync`.
+4. **Dropped events?** Non-zero `galaxy.events.dropped` means a slow
+   downstream consumer. Profile `OnDataChange` handlers in
+   `DriverNodeManager` before bumping the channel capacity.
+5. **Memory growing?** Confirm with the soak scenario's working-set
+   leak guard. Likely culprits: lingering subscription handles in
+   `SubscriptionRegistry`, or a downstream consumer retaining
+   `DataValueSnapshot` references past their useful life.