Resolves the 35 findings from the 2026-06-01 baseline (commit 26ba1c7),
test-first for every behavioral change. +51 tests (331 -> 382 passing, 0 failed).
- Telemetry-001 (HIGH): RedactionEnricher now honours property removal, so a
redactor that drops a key actually scrubs the secret from the event.
- Auth: LDAP validator ValidateOnStart; API-key verify no longer fails on a
best-effort MarkUsed write or a corrupt scopes column (fail-closed); LDAP cert
validation hook; KeyPrefix persistence aligned; README algorithm corrected.
- Health: Akka checks return Degraded (not throw) when the cluster isn't up yet;
GrpcDependencyHealthCheck catch-all; null 'description' rendered; composite
endpoint builder; XML docs shipped.
- Audit: CompositeAuditWriter no longer re-throws OperationCanceledException;
TruncatingAuditRedactor over-redact scrubs Target + safe negative max; options
record; XML docs shipped.
- Configuration: TryAddEnumerable idempotent registration; consistent port
quoting; strict invariant port parsing; XML docs + README packaged.
- Theme: mobile toggle is now CSS-only (no Bootstrap JS); token/CSS hygiene;
XML docs on the public parameter surface.
Shared-contract/spec docs updated where the code was the source of truth
(observability service.instance.id, MapZbMetrics, redactor reach). All changes
additive/back-compatible at v0.1.0. code-reviews bookkeeping follows separately.
11 KiB
Observability — Metric conventions (standardized)
Status: Standardized. The naming and unit rules every sister project's instruments must
follow. Analogous to ../auth/spec/CANONICAL-ROLES.md
for auth and ../ui-theme/spec/DESIGN-TOKENS.md
for the UI kit. Authoritative alongside SPEC.md.
The per-project instrument tables below (§4) document the existing bespoke surface — the instruments each app currently defines or intends to define. These stay per-project; they are not candidates for the shared library. The rules in §1–§3 govern how those instruments must be named and measured.
1. Meter name
Each app owns exactly one primary Meter, named after its root namespace:
| App | Meter name | Status |
|---|---|---|
| OtOpcUa | ZB.MOM.WW.OtOpcUa |
Correct today |
| MxGateway | MxGateway.Server |
⚠ Convergence target — rename to ZB.MOM.WW.MxGateway on adoption |
| ScadaBridge | ZB.MOM.WW.ScadaBridge |
Target (no meter exists today) |
MxGateway.Server is the single convergence item for meter naming. It predates the
ZB.MOM.WW.* namespace convention; rename when adopting AddZbTelemetry. Instruments
emitted under the old name will require a recording_rule or relabel in any Prometheus
config that already scrapes the snapshot — coordinate before renaming in production.
If an app has secondary meters (e.g. a library component with its own meter), those follow
the same pattern: ZB.MOM.WW.<App>.<Component>.
2. Instrument name
Instrument names follow the pattern <app>.<subsystem>.<event>, all lower-case,
dot-separated:
<app> := short app identifier — otopcua | mxgateway | scadabridge
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
<event> := what happened or is measured — applied | count | duration | errors | active | ...
Examples:
| Instrument name | App | Meaning |
|---|---|---|
otopcua.deploy.applied |
OtOpcUa | Galaxy deploy events applied to the address space |
otopcua.deploy.apply.duration |
OtOpcUa | End-to-end deploy apply duration |
mxgateway.sessions.open |
MxGateway | Currently open MxAccess sessions |
mxgateway.commands.duration |
MxGateway | End-to-end MXAccess command latency |
scadabridge.alarm.received |
ScadaBridge | Alarms received by the DCL |
Rules:
- All lower-case. No camelCase, no PascalCase, no hyphens.
- Three segments minimum (
<app>.<subsystem>.<event>). Four are permitted when the subsystem warrants a sub-area (e.g.mxgateway.commands.duration). - Event nouns describe what is counted or measured (
applied,errors,active,duration), not implementation details (method_called,loop_iteration). - Counters: past-tense or noun (
received,errors,applied). UpDownCounters / gauges: present-state noun or adjective (active,connected). Histograms:durationor a measured quantity noun (size,lag).
3. Units
Duration — seconds (mandatory)
All duration histograms MUST use seconds ("s"). This is the OpenTelemetry semantic
convention (UCUM: s). Backends and dashboards assume seconds; mixing units breaks
aggregations across apps.
⚠ MxGateway convergence item:
GatewayMetrics.csdefines three histograms with unit"ms"(CommandDuration,EventDuration,WorkerCallDuration). These must be migrated to"s"on adoption. Values must also be converted (divide by 1 000 at the call site). Track existing Prometheusrecording_rule/dashboard changes — any dashboard panel that reads these histograms inmswill need updating. Until migration is complete, annotate the instruments with// CONVERGENCE: ms→s pending.
Other units
| Quantity | Unit string | Notes |
|---|---|---|
| Duration | "s" |
Mandatory — see above |
| Size / bytes | "By" |
UCUM bytes |
| Count (dimensionless) | "1" or omit |
For pure event counts; "1" preferred |
| Messages, requests | "{message}", "{request}" |
UCUM annotation form for dimensioned counts |
4. Resource attribute set (shared across all three signals)
The OTel Resource is built once by AddZbTelemetry (see SPEC.md §2) and
attached to metrics, traces, and OTel-exported logs. The same SiteId and NodeRole values
populate Serilog enrichers, making a metric, a span, and a log line from the same node
joinable in any OTel-compatible backend.
| OTel attribute | Type | Required | Notes |
|---|---|---|---|
service.name |
string | Yes | Short lower-case app id: otopcua, mxgateway, scadabridge |
service.namespace |
string | Yes | Always "ZB.MOM.WW" — do not override |
service.version |
string | Recommended | Populate from AssemblyInformationalVersion; absent is better than wrong |
service.instance.id |
string | Auto | Always ZbResource.InstanceId = deterministic MachineName:ProcessId. The OTel SDK random-GUID default is disabled so every signal from one process shares one restart-stable instance id (cross-signal correlation); never override |
site.id |
string | Recommended | Physical or logical site identifier; omit for single-site deployments |
node.role |
string | Recommended | Node function: "central", "site", "hub", "standalone" |
host.name |
string | Auto | Always Environment.MachineName; never override |
Why site.id and node.role matter: a ScadaBridge fleet runs N site clusters + one
central cluster, each on different hosts. Without site.id and node.role, metrics from a
site node and the central node are indistinguishable even if host.name differs.
5. Standard instrumentation baseline
Every app enables this baseline via AddZbTelemetry. No opt-out. These are community-
standard instrumentation packages; the overhead is negligible and the benefit (correlated
HTTP / gRPC request traces across the fleet) is high.
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|---|---|---|---|---|
| ASP.NET Core | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| HttpClient | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| gRPC client | Traces | ⛔ not added | ⛔ not added | n/a |
| .NET runtime | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
| Process | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
All three projects lack standard instrumentation today — it is added automatically when each
project calls AddZbTelemetry (Gap S1 in GAPS.md). No project removes any of these once
wired.
6. Per-app instrument surface (bespoke — stays per project)
These instruments are not part of the shared library. They document the existing bespoke
surface that each project registers through o.Meters / o.ActivitySources in AddZbTelemetry.
6.1 OtOpcUa — ZB.MOM.WW.OtOpcUa meter
Source: src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs
(Code-verified 2026-06-01 — see current-state/otopcua/CURRENT-STATE.md.)
Counters (7):
| Instrument | Kind | Unit | Description |
|---|---|---|---|
otopcua.deploy.applied |
Counter | — | Galaxy deploy events applied to the OPC UA address space |
otopcua.driver.lifecycle |
Counter | — | Driver lifecycle events (start / stop / restart) |
otopcua.virtualtag.eval |
Counter | — | Virtual tag evaluations |
otopcua.scriptedalarm.transition |
Counter | — | Scripted alarm state transitions |
otopcua.opcua.sink.write |
Counter | — | OPC UA sink write operations |
otopcua.redundancy.service_level_change |
Counter | — | Redundancy service-level changes |
Histograms (1):
| Instrument | Kind | Unit | Description |
|---|---|---|---|
otopcua.deploy.apply.duration |
Histogram | s |
End-to-end deploy apply duration |
ActivitySources (spans):
| Source name | Spans |
|---|---|
ZB.MOM.WW.OtOpcUa |
otopcua.deploy.apply, otopcua.opcua.address_space_rebuild |
All durations use "s" — no unit convergence item for OtOpcUa.
6.2 MxGateway — MxGateway.Server meter (→ target: ZB.MOM.WW.MxGateway)
Source: src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs
(Code-verified 2026-06-01 — see current-state/mxaccessgw/CURRENT-STATE.md.)
Counters (13):
| Instrument | Unit | Description |
|---|---|---|
mxgateway.sessions.opened |
"1" |
New session requests |
mxgateway.sessions.closed |
"1" |
Sessions torn down |
mxgateway.commands.started |
"1" |
MXAccess command dispatched |
mxgateway.commands.succeeded |
"1" |
Command completed OK |
mxgateway.commands.failed |
"1" |
Command error |
mxgateway.events.received |
"1" |
MXAccess events received from worker |
mxgateway.queues.overflows |
"1" |
Queue overflow (backpressure) |
mxgateway.faults |
"1" |
Unhandled gateway faults |
mxgateway.workers.killed |
"1" |
Worker process forcibly terminated |
mxgateway.workers.exited |
"1" |
Worker process exited cleanly |
mxgateway.heartbeats.failed |
"1" |
Worker heartbeat timeouts |
mxgateway.grpc.streams.disconnected |
"1" |
gRPC event stream disconnects |
mxgateway.retries.attempted |
"1" |
Retry attempts (any subsystem) |
Histograms (3) — current unit ms (convergence target s):
| Instrument | Target unit | Current unit | Convergence |
|---|---|---|---|
mxgateway.workers.startup.duration |
"s" |
"ms" |
⚠ Convert ms→s on adoption |
mxgateway.commands.duration |
"s" |
"ms" |
⚠ Convert ms→s on adoption |
mxgateway.events.stream_send.duration |
"s" |
"ms" |
⚠ Convert ms→s on adoption |
Observable gauges (4):
| Instrument | Unit | Description |
|---|---|---|
mxgateway.sessions.open |
"1" |
Currently open sessions (live count) |
mxgateway.workers.running |
"1" |
Currently running worker processes |
mxgateway.events.worker_queue.depth |
"1" |
Per-worker event queue depth |
mxgateway.events.grpc_stream_queue.depth |
"1" |
Per-stream gRPC send queue depth |
No ActivitySources today (no tracing). Adding ZB.MOM.WW.MxGateway as an ActivitySource
is left per-project (deferred to GAPS backlog).
6.3 ScadaBridge — ZB.MOM.WW.ScadaBridge meter
No meter or instruments exist today (OpenTelemetry.Api is a dangling ref). The target
meter name ZB.MOM.WW.ScadaBridge is reserved. Instruments are defined as part of the
ScadaBridge adoption tracked in ../GAPS.md.
Consequences and convergence items (accepted)
| Item | Scope | Severity |
|---|---|---|
MxGateway meter rename MxGateway.Server → ZB.MOM.WW.MxGateway |
MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
MxGateway histogram unit ms → s (3 instruments) |
MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
All three items are tracked as backlog entries in ../GAPS.md. The ms→s
migration is the highest-priority convergence item because leaving it unresolved means
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
workspace.