Files
lmxopcua/docs/v2/Galaxy.Performance.md
Joseph Doherty edee47d77f PR 6.W — Galaxy.Performance.md
Documents the four perf surfaces shipped in Phase 6:

- Tracing surface (PR 6.1) — table of every span the driver emits +
  rationale for stream-level (not per-event) coverage.
- Metrics surface (PR 6.2) — three EventPump counters, tagging
  scheme, the bounded-channel design, and the
  received = dispatched + dropped + in-flight invariant.
- Buffered update interval (PR 6.3) — how MxAccess.PublishingIntervalMs
  flows through both subscribe paths and what's still pending on the
  gw side (typed SetBufferedUpdateInterval helper).
- Soak scenario (PR 6.4) — env-var-gated 24h × 50k validation with
  the CI-compressed override recipe.
- Tuned defaults (PR 6.5) — table of every default with source +
  notes; rows marked "unchanged" carry the explicit "no live data
  argues for changing this" caveat.

Closes with a "where to look first when something's slow" runbook
section so on-call doesn't have to re-derive the trace+metric
correlation map from primary docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 17:04:23 -04:00

153 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Galaxy backend performance
This document covers the performance surface of the in-process
`GalaxyDriver` (the v2 mxgw backend) — the ActivitySource it emits, the
metrics on its EventPump, the soak scenario that validates it, and the
tuning knobs you can reach for when the dev parity rig surfaces a hot
spot.
## Tracing surface (PR 6.1)
The driver emits spans on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy`
ActivitySource. No package dependency on OpenTelemetry — the host
process picks the listener (OTLP exporter, dotnet-trace, Application
Insights). Wire it via `OpenTelemetry.Trace.AddSource(...)` in the
host's tracing pipeline.
| Span | Source | Tags |
|------|--------|------|
| `galaxy.subscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count` |
| `galaxy.unsubscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count` |
| `galaxy.stream_events` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.event_count` (set on stream end) |
| `galaxy.write` | `TracedGalaxyDataWriter` | `galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count` |
| `galaxy.get_hierarchy` | `TracedGalaxyHierarchySource` | `galaxy.client`, `galaxy.object_count` |
The stream-events span deliberately covers the *entire* stream lifetime
rather than per-event spans — at 50k tags / 1Hz the per-event volume
would dominate the trace pipeline. Per-event visibility flows through
the metrics surface instead.
## Metrics surface (PR 6.2)
`EventPump` publishes three counters on the
`ZB.MOM.WW.OtOpcUa.Driver.Galaxy` meter, each tagged with
`galaxy.client` so multi-driver hosts can split by source:
| Counter | Unit | Meaning |
|---------|------|---------|
| `galaxy.events.received` | `{event}` | MxEvents read from the gateway StreamEvents stream |
| `galaxy.events.dispatched` | `{event}` | MxEvents that made it through the bounded channel into `OnDataChange` |
| `galaxy.events.dropped` | `{event}` | MxEvents discarded because the bounded channel was full (newest-dropped) |
The invariant is `received = dispatched + dropped + (in-flight in the
channel)`. Watch the dropped counter — it is the leading indicator of
listener back-pressure. A non-zero dropped rate means a downstream
consumer (DriverNodeManager → UA notification queue → client) is
slower than the gw event stream; investigate that consumer before
raising `EventPump` channel capacity.
### Bounded channel design
The pump runs two background tasks:
1. **Producer** — reads from `IGalaxySubscriber.StreamEventsAsync`,
increments `events.received`, and `TryWrite`s into a bounded
`Channel<MxEvent>`. When the channel is full, the producer counts
the drop and continues reading the gw stream so back-pressure does
not propagate upstream (which would stall the gw worker and cascade
to *all* driver instances sharing that worker).
2. **Consumer** — reads from the channel, fans out via
`SubscriptionRegistry`, increments `events.dispatched`.
Default channel capacity is 50_000 (one second of headroom at 50k
tags / 1Hz). Override via the `EventPump` constructor's
`channelCapacity` parameter; the public-facing wiring path in
`GalaxyDriver.EnsureEventPumpStarted` does not yet expose this through
`GalaxyDriverOptions` because no parity scenario has needed it. Add it
when soak data does.
## Buffered update interval (PR 6.3)
`MxAccess.PublishingIntervalMs` (default 1000) flows through both
subscribe paths:
- `GalaxyDriver.SubscribeAsync` — the caller's `publishingInterval`
wins when non-zero (the server's UA subscription publishingInterval
drives this in production). When the caller passes
`TimeSpan.Zero`, the configured option is the fallback.
- `PerPlatformProbeWatcher` — the watcher passes the configured value
through `SubscribeBulkAsync` so probe `ScanState` changes publish at
the deployment's chosen cadence.
A session-level `SetBufferedUpdateInterval` RPC exists in the gw
protocol but the .NET client doesn't expose a typed helper yet —
adjusting an existing subscription's interval mid-flight is a
follow-up. Today's path subscribes once at the right interval, which
covers the common case.
## Soak scenario (PR 6.4)
`SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing`
in `Driver.Galaxy.ParityTests` is the long-running validation. It
subscribes a configurable tag count (default 50_000), holds the
subscription for a configurable duration (default 24h), polls the
three counters every minute, and asserts:
- `events.received` continues to grow (gw stream isn't stuck)
- `events.dropped / events.received` stays under the configured
ceiling (default 0.5%)
- process working-set doesn't grow more than 1 GB above baseline
(leak guard)
Always skipped unless the operator opts in:
```bash
# Full 24h × 50k soak (production validation)
OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
# Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling)
OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \
OTOPCUA_SOAK_DROP_PCT=1.0 \
dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
```
The scenario writes a per-minute CSV-style row to stdout
(`soak,<minutes>,received=…,dispatched=…,dropped=…,ws_mb=…`) so an
operator can grep the test runner output mid-run.
## Tuned defaults (PR 6.5)
| Option | Default | Source | Notes |
|--------|---------|--------|-------|
| `Gateway.ConnectTimeoutSeconds` | 10 | unchanged | Cold-start network paths fit comfortably; soak never observed >2s |
| `Gateway.DefaultCallTimeoutSeconds` | 30 | **bumped from 5** in PR 6.5 | A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker |
| `Gateway.StreamTimeoutSeconds` | 0 (unlimited) | unchanged | The stream must run for the lifetime of the driver |
| `MxAccess.PublishingIntervalMs` | 1000 | unchanged | Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down |
| `Reconnect.InitialBackoffMs` | 500 | unchanged | First retry shouldn't dogpile a recovering gw |
| `Reconnect.MaxBackoffMs` | 30_000 | unchanged | 30s ceiling so a long-down gw doesn't sit in 5+ min backoff |
| `Repository.DiscoverPageSize` | 5000 | unchanged | One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure |
| `EventPump` channel capacity | 50_000 | unchanged | One second of headroom at 50k tags / 1Hz |
The unchanged rows are not "definitely correct" — they are "no live
data argues for changing them." Re-run the soak scenario after every
substantive driver change, and revise this table when the data does.
## Where to look first when something's slow
1. **Slow `Discover`?** Inspect `galaxy.get_hierarchy` span duration
and `galaxy.object_count`. The gw walks the Galaxy DB serially;
slow Discovers usually mean a slow ZB SQL.
2. **Subscribe pile-up?** `galaxy.subscribe_bulk` span duration
correlates with `galaxy.tag_count`. If duration ÷ tag_count starts
climbing, the gw worker is probably under apartment-lock pressure.
3. **Events stalled?** Watch `galaxy.events.received`. Flat-lined
means the gw stream is wedged — kick the reconnect supervisor by
forcing a `ReinitializeAsync`.
4. **Dropped events?** Non-zero `galaxy.events.dropped` means a slow
downstream consumer. Profile `OnDataChange` handlers in
`DriverNodeManager` before bumping the channel capacity.
5. **Memory growing?** Confirm with the soak scenario's working-set
leak guard. Likely culprits: lingering subscription handles in
`SubscriptionRegistry`, or a downstream consumer retaining
`DataValueSnapshot` references past their useful life.