544a6ddb77
Resolves the 35 findings from the 2026-06-01 baseline (commit 26ba1c7),
test-first for every behavioral change. +51 tests (331 -> 382 passing, 0 failed).
- Telemetry-001 (HIGH): RedactionEnricher now honours property removal, so a
redactor that drops a key actually scrubs the secret from the event.
- Auth: LDAP validator ValidateOnStart; API-key verify no longer fails on a
best-effort MarkUsed write or a corrupt scopes column (fail-closed); LDAP cert
validation hook; KeyPrefix persistence aligned; README algorithm corrected.
- Health: Akka checks return Degraded (not throw) when the cluster isn't up yet;
GrpcDependencyHealthCheck catch-all; null 'description' rendered; composite
endpoint builder; XML docs shipped.
- Audit: CompositeAuditWriter no longer re-throws OperationCanceledException;
TruncatingAuditRedactor over-redact scrubs Target + safe negative max; options
record; XML docs shipped.
- Configuration: TryAddEnumerable idempotent registration; consistent port
quoting; strict invariant port parsing; XML docs + README packaged.
- Theme: mobile toggle is now CSS-only (no Bootstrap JS); token/CSS hygiene;
XML docs on the public parameter surface.
Shared-contract/spec docs updated where the code was the source of truth
(observability service.instance.id, MapZbMetrics, redactor reach). All changes
additive/back-compatible at v0.1.0. code-reviews bookkeeping follows separately.
236 lines
11 KiB
Markdown
236 lines
11 KiB
Markdown
# Observability — Metric conventions (standardized)
|
|
|
|
Status: **Standardized**. The naming and unit rules every sister project's instruments must
|
|
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
|
|
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
|
|
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
|
|
|
|
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
|
|
instruments each app currently defines or intends to define. These stay per-project; they are
|
|
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
|
|
be named and measured.
|
|
|
|
---
|
|
|
|
## 1. Meter name
|
|
|
|
Each app owns exactly **one primary Meter**, named after its root namespace:
|
|
|
|
| App | Meter name | Status |
|
|
|---|---|---|
|
|
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
|
|
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
|
|
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
|
|
|
|
`MxGateway.Server` is the single convergence item for meter naming. It predates the
|
|
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
|
|
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
|
|
config that already scrapes the snapshot — coordinate before renaming in production.
|
|
|
|
If an app has secondary meters (e.g. a library component with its own meter), those follow
|
|
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
|
|
|
|
---
|
|
|
|
## 2. Instrument name
|
|
|
|
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
|
|
dot-separated:
|
|
|
|
```
|
|
<app> := short app identifier — otopcua | mxgateway | scadabridge
|
|
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
|
|
<event> := what happened or is measured — applied | count | duration | errors | active | ...
|
|
```
|
|
|
|
**Examples:**
|
|
|
|
| Instrument name | App | Meaning |
|
|
|---|---|---|
|
|
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
|
|
| `otopcua.deploy.apply.duration` | OtOpcUa | End-to-end deploy apply duration |
|
|
| `mxgateway.sessions.open` | MxGateway | Currently open MxAccess sessions |
|
|
| `mxgateway.commands.duration` | MxGateway | End-to-end MXAccess command latency |
|
|
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
|
|
|
|
**Rules:**
|
|
|
|
1. All lower-case. No camelCase, no PascalCase, no hyphens.
|
|
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
|
|
subsystem warrants a sub-area (e.g. `mxgateway.commands.duration`).
|
|
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
|
|
`duration`), not implementation details (`method_called`, `loop_iteration`).
|
|
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
|
|
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
|
|
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
|
|
|
|
---
|
|
|
|
## 3. Units
|
|
|
|
### Duration — seconds (mandatory)
|
|
|
|
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
|
|
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
|
|
aggregations across apps.
|
|
|
|
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
|
|
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
|
|
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
|
|
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
|
|
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
|
|
> the instruments with `// CONVERGENCE: ms→s pending`.
|
|
|
|
### Other units
|
|
|
|
| Quantity | Unit string | Notes |
|
|
|---|---|---|
|
|
| Duration | `"s"` | Mandatory — see above |
|
|
| Size / bytes | `"By"` | UCUM bytes |
|
|
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
|
|
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
|
|
|
|
---
|
|
|
|
## 4. Resource attribute set (shared across all three signals)
|
|
|
|
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
|
|
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
|
|
populate Serilog enrichers, making a metric, a span, and a log line from the same node
|
|
joinable in any OTel-compatible backend.
|
|
|
|
| OTel attribute | Type | Required | Notes |
|
|
|---|---|---|---|
|
|
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
|
|
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
|
|
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
|
|
| `service.instance.id` | string | Auto | Always `ZbResource.InstanceId` = deterministic `MachineName:ProcessId`. The OTel SDK random-GUID default is disabled so every signal from one process shares one restart-stable instance id (cross-signal correlation); never override |
|
|
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
|
|
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
|
|
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
|
|
|
|
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
|
|
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
|
|
site node and the central node are indistinguishable even if `host.name` differs.
|
|
|
|
---
|
|
|
|
## 5. Standard instrumentation baseline
|
|
|
|
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
|
|
standard instrumentation packages; the overhead is negligible and the benefit (correlated
|
|
HTTP / gRPC request traces across the fleet) is high.
|
|
|
|
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|
|
|---|---|---|---|---|
|
|
| ASP.NET Core | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
| HttpClient | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
| gRPC client | Traces | ⛔ not added | ⛔ not added | n/a |
|
|
| .NET runtime | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
| Process | Metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
|
|
|
All three projects lack standard instrumentation today — it is added automatically when each
|
|
project calls `AddZbTelemetry` (Gap S1 in `GAPS.md`). No project removes any of these once
|
|
wired.
|
|
|
|
---
|
|
|
|
## 6. Per-app instrument surface (bespoke — stays per project)
|
|
|
|
These instruments are **not part of the shared library**. They document the existing bespoke
|
|
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
|
|
|
|
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
|
|
|
|
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
|
|
(Code-verified 2026-06-01 — see `current-state/otopcua/CURRENT-STATE.md`.)
|
|
|
|
**Counters (7):**
|
|
|
|
| Instrument | Kind | Unit | Description |
|
|
|---|---|---|---|
|
|
| `otopcua.deploy.applied` | Counter | — | Galaxy deploy events applied to the OPC UA address space |
|
|
| `otopcua.driver.lifecycle` | Counter | — | Driver lifecycle events (start / stop / restart) |
|
|
| `otopcua.virtualtag.eval` | Counter | — | Virtual tag evaluations |
|
|
| `otopcua.scriptedalarm.transition` | Counter | — | Scripted alarm state transitions |
|
|
| `otopcua.opcua.sink.write` | Counter | — | OPC UA sink write operations |
|
|
| `otopcua.redundancy.service_level_change` | Counter | — | Redundancy service-level changes |
|
|
|
|
**Histograms (1):**
|
|
|
|
| Instrument | Kind | Unit | Description |
|
|
|---|---|---|---|
|
|
| `otopcua.deploy.apply.duration` | Histogram | `s` | End-to-end deploy apply duration |
|
|
|
|
**ActivitySources (spans):**
|
|
|
|
| Source name | Spans |
|
|
|---|---|
|
|
| `ZB.MOM.WW.OtOpcUa` | `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild` |
|
|
|
|
All durations use `"s"` — no unit convergence item for OtOpcUa.
|
|
|
|
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
|
|
|
|
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
|
|
(Code-verified 2026-06-01 — see `current-state/mxaccessgw/CURRENT-STATE.md`.)
|
|
|
|
**Counters (13):**
|
|
|
|
| Instrument | Unit | Description |
|
|
|---|---|---|
|
|
| `mxgateway.sessions.opened` | `"1"` | New session requests |
|
|
| `mxgateway.sessions.closed` | `"1"` | Sessions torn down |
|
|
| `mxgateway.commands.started` | `"1"` | MXAccess command dispatched |
|
|
| `mxgateway.commands.succeeded` | `"1"` | Command completed OK |
|
|
| `mxgateway.commands.failed` | `"1"` | Command error |
|
|
| `mxgateway.events.received` | `"1"` | MXAccess events received from worker |
|
|
| `mxgateway.queues.overflows` | `"1"` | Queue overflow (backpressure) |
|
|
| `mxgateway.faults` | `"1"` | Unhandled gateway faults |
|
|
| `mxgateway.workers.killed` | `"1"` | Worker process forcibly terminated |
|
|
| `mxgateway.workers.exited` | `"1"` | Worker process exited cleanly |
|
|
| `mxgateway.heartbeats.failed` | `"1"` | Worker heartbeat timeouts |
|
|
| `mxgateway.grpc.streams.disconnected` | `"1"` | gRPC event stream disconnects |
|
|
| `mxgateway.retries.attempted` | `"1"` | Retry attempts (any subsystem) |
|
|
|
|
**Histograms (3) — current unit `ms` (convergence target `s`):**
|
|
|
|
| Instrument | Target unit | Current unit | Convergence |
|
|
|---|---|---|---|
|
|
| `mxgateway.workers.startup.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
| `mxgateway.commands.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
| `mxgateway.events.stream_send.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
|
|
|
**Observable gauges (4):**
|
|
|
|
| Instrument | Unit | Description |
|
|
|---|---|---|
|
|
| `mxgateway.sessions.open` | `"1"` | Currently open sessions (live count) |
|
|
| `mxgateway.workers.running` | `"1"` | Currently running worker processes |
|
|
| `mxgateway.events.worker_queue.depth` | `"1"` | Per-worker event queue depth |
|
|
| `mxgateway.events.grpc_stream_queue.depth` | `"1"` | Per-stream gRPC send queue depth |
|
|
|
|
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
|
|
is left per-project (deferred to GAPS backlog).
|
|
|
|
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
|
|
|
|
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
|
|
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
|
|
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
|
|
|
|
---
|
|
|
|
## Consequences and convergence items (accepted)
|
|
|
|
| Item | Scope | Severity |
|
|
|---|---|---|
|
|
| MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
|
|
| MxGateway histogram unit `ms` → `s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
|
|
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
|
|
|
|
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
|
|
migration is the highest-priority convergence item because leaving it unresolved means
|
|
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
|
|
workspace.
|