PR 6.W — Galaxy.Performance.md
Documents the four perf surfaces shipped in Phase 6: - Tracing surface (PR 6.1) — table of every span the driver emits + rationale for stream-level (not per-event) coverage. - Metrics surface (PR 6.2) — three EventPump counters, tagging scheme, the bounded-channel design, and the received = dispatched + dropped + in-flight invariant. - Buffered update interval (PR 6.3) — how MxAccess.PublishingIntervalMs flows through both subscribe paths and what's still pending on the gw side (typed SetBufferedUpdateInterval helper). - Soak scenario (PR 6.4) — env-var-gated 24h × 50k validation with the CI-compressed override recipe. - Tuned defaults (PR 6.5) — table of every default with source + notes; rows marked "unchanged" carry the explicit "no live data argues for changing this" caveat. Closes with a "where to look first when something's slow" runbook section so on-call doesn't have to re-derive the trace+metric correlation map from primary docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
152
docs/v2/Galaxy.Performance.md
Normal file
152
docs/v2/Galaxy.Performance.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# Galaxy backend performance
|
||||
|
||||
This document covers the performance surface of the in-process
|
||||
`GalaxyDriver` (the v2 mxgw backend) — the ActivitySource it emits, the
|
||||
metrics on its EventPump, the soak scenario that validates it, and the
|
||||
tuning knobs you can reach for when the dev parity rig surfaces a hot
|
||||
spot.
|
||||
|
||||
## Tracing surface (PR 6.1)
|
||||
|
||||
The driver emits spans on the `ZB.MOM.WW.OtOpcUa.Driver.Galaxy`
|
||||
ActivitySource. No package dependency on OpenTelemetry — the host
|
||||
process picks the listener (OTLP exporter, dotnet-trace, Application
|
||||
Insights). Wire it via `OpenTelemetry.Trace.AddSource(...)` in the
|
||||
host's tracing pipeline.
|
||||
|
||||
| Span | Source | Tags |
|
||||
|------|--------|------|
|
||||
| `galaxy.subscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count`, `galaxy.buffered_interval_ms`, `galaxy.success_count` |
|
||||
| `galaxy.unsubscribe_bulk` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.tag_count` |
|
||||
| `galaxy.stream_events` | `TracedGalaxySubscriber` | `galaxy.client`, `galaxy.event_count` (set on stream end) |
|
||||
| `galaxy.write` | `TracedGalaxyDataWriter` | `galaxy.client`, `galaxy.tag_count`, `galaxy.secured_write_count`, `galaxy.success_count` |
|
||||
| `galaxy.get_hierarchy` | `TracedGalaxyHierarchySource` | `galaxy.client`, `galaxy.object_count` |
|
||||
|
||||
The stream-events span deliberately covers the *entire* stream lifetime
|
||||
rather than per-event spans — at 50k tags / 1Hz the per-event volume
|
||||
would dominate the trace pipeline. Per-event visibility flows through
|
||||
the metrics surface instead.
|
||||
|
||||
## Metrics surface (PR 6.2)
|
||||
|
||||
`EventPump` publishes three counters on the
|
||||
`ZB.MOM.WW.OtOpcUa.Driver.Galaxy` meter, each tagged with
|
||||
`galaxy.client` so multi-driver hosts can split by source:
|
||||
|
||||
| Counter | Unit | Meaning |
|
||||
|---------|------|---------|
|
||||
| `galaxy.events.received` | `{event}` | MxEvents read from the gateway StreamEvents stream |
|
||||
| `galaxy.events.dispatched` | `{event}` | MxEvents that made it through the bounded channel into `OnDataChange` |
|
||||
| `galaxy.events.dropped` | `{event}` | MxEvents discarded because the bounded channel was full (newest-dropped) |
|
||||
|
||||
The invariant is `received = dispatched + dropped + (in-flight in the
|
||||
channel)`. Watch the dropped counter — it is the leading indicator of
|
||||
listener back-pressure. A non-zero dropped rate means a downstream
|
||||
consumer (DriverNodeManager → UA notification queue → client) is
|
||||
slower than the gw event stream; investigate that consumer before
|
||||
raising `EventPump` channel capacity.
|
||||
|
||||
### Bounded channel design
|
||||
|
||||
The pump runs two background tasks:
|
||||
|
||||
1. **Producer** — reads from `IGalaxySubscriber.StreamEventsAsync`,
|
||||
increments `events.received`, and `TryWrite`s into a bounded
|
||||
`Channel<MxEvent>`. When the channel is full, the producer counts
|
||||
the drop and continues reading the gw stream so back-pressure does
|
||||
not propagate upstream (which would stall the gw worker and cascade
|
||||
to *all* driver instances sharing that worker).
|
||||
2. **Consumer** — reads from the channel, fans out via
|
||||
`SubscriptionRegistry`, increments `events.dispatched`.
|
||||
|
||||
Default channel capacity is 50_000 (one second of headroom at 50k
|
||||
tags / 1Hz). Override via the `EventPump` constructor's
|
||||
`channelCapacity` parameter; the public-facing wiring path in
|
||||
`GalaxyDriver.EnsureEventPumpStarted` does not yet expose this through
|
||||
`GalaxyDriverOptions` because no parity scenario has needed it. Add it
|
||||
when soak data does.
|
||||
|
||||
## Buffered update interval (PR 6.3)
|
||||
|
||||
`MxAccess.PublishingIntervalMs` (default 1000) flows through both
|
||||
subscribe paths:
|
||||
|
||||
- `GalaxyDriver.SubscribeAsync` — the caller's `publishingInterval`
|
||||
wins when non-zero (the server's UA subscription publishingInterval
|
||||
drives this in production). When the caller passes
|
||||
`TimeSpan.Zero`, the configured option is the fallback.
|
||||
- `PerPlatformProbeWatcher` — the watcher passes the configured value
|
||||
through `SubscribeBulkAsync` so probe `ScanState` changes publish at
|
||||
the deployment's chosen cadence.
|
||||
|
||||
A session-level `SetBufferedUpdateInterval` RPC exists in the gw
|
||||
protocol but the .NET client doesn't expose a typed helper yet —
|
||||
adjusting an existing subscription's interval mid-flight is a
|
||||
follow-up. Today's path subscribes once at the right interval, which
|
||||
covers the common case.
|
||||
|
||||
## Soak scenario (PR 6.4)
|
||||
|
||||
`SoakScenarioTests.Soak_HoldsSubscription_AndKeepsEventStreamFlowing`
|
||||
in `Driver.Galaxy.ParityTests` is the long-running validation. It
|
||||
subscribes a configurable tag count (default 50_000), holds the
|
||||
subscription for a configurable duration (default 24h), polls the
|
||||
three counters every minute, and asserts:
|
||||
|
||||
- `events.received` continues to grow (gw stream isn't stuck)
|
||||
- `events.dropped / events.received` stays under the configured
|
||||
ceiling (default 0.5%)
|
||||
- process working-set doesn't grow more than 1 GB above baseline
|
||||
(leak guard)
|
||||
|
||||
Always skipped unless the operator opts in:
|
||||
|
||||
```bash
|
||||
# Full 24h × 50k soak (production validation)
|
||||
OTOPCUA_SOAK_RUN=1 dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
|
||||
|
||||
# Compressed CI-friendly run (10min × 1k tags, 1% drop ceiling)
|
||||
OTOPCUA_SOAK_RUN=1 OTOPCUA_SOAK_MINUTES=10 OTOPCUA_SOAK_TAGS=1000 \
|
||||
OTOPCUA_SOAK_DROP_PCT=1.0 \
|
||||
dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.ParityTests/
|
||||
```
|
||||
|
||||
The scenario writes a per-minute CSV-style row to stdout
|
||||
(`soak,<minutes>,received=…,dispatched=…,dropped=…,ws_mb=…`) so an
|
||||
operator can grep the test runner output mid-run.
|
||||
|
||||
## Tuned defaults (PR 6.5)
|
||||
|
||||
| Option | Default | Source | Notes |
|
||||
|--------|---------|--------|-------|
|
||||
| `Gateway.ConnectTimeoutSeconds` | 10 | unchanged | Cold-start network paths fit comfortably; soak never observed >2s |
|
||||
| `Gateway.DefaultCallTimeoutSeconds` | 30 | **bumped from 5** in PR 6.5 | A 50k-tag `SubscribeBulk` can exceed 5s under MxAccess COM apartment lock contention; 30s leaves headroom while still failing fast on a wedged worker |
|
||||
| `Gateway.StreamTimeoutSeconds` | 0 (unlimited) | unchanged | The stream must run for the lifetime of the driver |
|
||||
| `MxAccess.PublishingIntervalMs` | 1000 | unchanged | Matches the legacy `LMXProxyServer` cadence; deployments needing tighter health visibility can dial down |
|
||||
| `Reconnect.InitialBackoffMs` | 500 | unchanged | First retry shouldn't dogpile a recovering gw |
|
||||
| `Reconnect.MaxBackoffMs` | 30_000 | unchanged | 30s ceiling so a long-down gw doesn't sit in 5+ min backoff |
|
||||
| `Repository.DiscoverPageSize` | 5000 | unchanged | One Galaxy page round-trip per ~5k objects; soak hadn't surfaced pressure |
|
||||
| `EventPump` channel capacity | 50_000 | unchanged | One second of headroom at 50k tags / 1Hz |
|
||||
|
||||
The unchanged rows are not "definitely correct" — they are "no live
|
||||
data argues for changing them." Re-run the soak scenario after every
|
||||
substantive driver change, and revise this table when the data does.
|
||||
|
||||
## Where to look first when something's slow
|
||||
|
||||
1. **Slow `Discover`?** Inspect `galaxy.get_hierarchy` span duration
|
||||
and `galaxy.object_count`. The gw walks the Galaxy DB serially;
|
||||
slow Discovers usually mean a slow ZB SQL.
|
||||
2. **Subscribe pile-up?** `galaxy.subscribe_bulk` span duration
|
||||
correlates with `galaxy.tag_count`. If duration ÷ tag_count starts
|
||||
climbing, the gw worker is probably under apartment-lock pressure.
|
||||
3. **Events stalled?** Watch `galaxy.events.received`. Flat-lined
|
||||
means the gw stream is wedged — kick the reconnect supervisor by
|
||||
forcing a `ReinitializeAsync`.
|
||||
4. **Dropped events?** Non-zero `galaxy.events.dropped` means a slow
|
||||
downstream consumer. Profile `OnDataChange` handlers in
|
||||
`DriverNodeManager` before bumping the channel capacity.
|
||||
5. **Memory growing?** Confirm with the soak scenario's working-set
|
||||
leak guard. Likely culprits: lingering subscription handles in
|
||||
`SubscriptionRegistry`, or a downstream consumer retaining
|
||||
`DataValueSnapshot` references past their useful life.
|
||||
Reference in New Issue
Block a user