diff --git a/components/observability/GAPS.md b/components/observability/GAPS.md index 8150ccc..b679c74 100644 --- a/components/observability/GAPS.md +++ b/components/observability/GAPS.md @@ -168,10 +168,16 @@ app is opt-in and tracked here, not forced. ## Decisions still open -- Whether `AddZbTelemetry` enables OTLP by default (simplest for new setups) or Prometheus by - default (matches OtOpcUa's current posture). Design doc says Prometheus default; OTLP opt-in. -- Whether the `ms` → `s` conversion and Meter rename are bundled with the initial MxGateway - adoption (Task #9) or deferred. Deferring avoids dashboard breaks during the migration window. - Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading from a fixed config section, to remain project-agnostic. + +## Decisions settled (no longer open) + +- **Prometheus vs OTLP default (SETTLED):** `AddZbTelemetry` defaults to Prometheus (matching + OtOpcUa's existing `/metrics` posture). OTLP is opt-in via `ZbTelemetryOptions.Exporter = + ZbExporter.Otlp`. See `spec/SPEC.md` §4 and shared-contract `ZbTelemetryOptions.Exporter`. +- **`ms`→`s` conversion and Meter rename bundling (SETTLED — deferred):** Both the histogram + unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway + adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and + are tracked as separate backlog items #6 and #7 in the adoption backlog above. diff --git a/components/observability/current-state/mxaccessgw/CURRENT-STATE.md b/components/observability/current-state/mxaccessgw/CURRENT-STATE.md index f1bf81d..5619239 100644 --- a/components/observability/current-state/mxaccessgw/CURRENT-STATE.md +++ b/components/observability/current-state/mxaccessgw/CURRENT-STATE.md @@ -166,7 +166,7 @@ the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library buil **Wire metrics export via `AddZbTelemetry`:** -- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; })`. +- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; /* temporary — update to "ZB.MOM.WW.MxGateway" when the Meter-rename gap (Gap N1) is closed */ })`. This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time. `GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it. diff --git a/components/observability/current-state/scadabridge/CURRENT-STATE.md b/components/observability/current-state/scadabridge/CURRENT-STATE.md index 843bf76..4da379b 100644 --- a/components/observability/current-state/scadabridge/CURRENT-STATE.md +++ b/components/observability/current-state/scadabridge/CURRENT-STATE.md @@ -127,9 +127,10 @@ renaming needed on adoption. **Adopt `AddZbSerilog`:** - Replace the `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` call in - `Program.cs:27–54` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.NodeHostname = cfg.NodeHostname; })`. + `Program.cs:27–54` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; })`. The three enrichers (`SiteId`, `NodeRole`, `NodeHostname`) are now provided by the shared - `AddZbSerilog` path; `LoggerConfigurationFactory` can be deleted. + `AddZbSerilog` path (`SiteId`/`NodeRole` from options; `NodeHostname` auto from + `Environment.MachineName`); `LoggerConfigurationFactory` can be deleted. - `ReadFrom.Configuration` for sinks and `MinimumLevel.Is` override from config are preserved inside `AddZbSerilog` — behavior is unchanged. - The `TraceContextEnricher` is wired automatically by `AddZbSerilog`; once application instruments diff --git a/components/observability/shared-contract/ZB.MOM.WW.Telemetry.md b/components/observability/shared-contract/ZB.MOM.WW.Telemetry.md index 6748297..152d39c 100644 --- a/components/observability/shared-contract/ZB.MOM.WW.Telemetry.md +++ b/components/observability/shared-contract/ZB.MOM.WW.Telemetry.md @@ -147,10 +147,16 @@ public static class ZbSerilogExtensions /// Two-stage Serilog bootstrap: /// Stage 1 — minimal console-only bootstrap logger (for startup errors before IConfiguration). /// Stage 2 — application logger wired from IConfiguration (ReadFrom.Configuration reads - /// Serilog:WriteTo sinks + Serilog:MinimumLevel overrides) with fixed enrichers: - /// SiteId, NodeRole, NodeHostname (from ZbTelemetryOptions), TraceContextEnricher, + /// Serilog:WriteTo sinks + Serilog:MinimumLevel from "Serilog:MinimumLevel") with + /// fixed enrichers: SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from + /// Environment.MachineName (auto — not a caller-supplied option); TraceContextEnricher; /// and RedactionEnricher (applied only when ILogRedactor is registered). /// + /// MinimumLevel: AddZbSerilog reads "Serilog:MinimumLevel" from IConfiguration. Callers that + /// bind MinimumLevel from a different config key (e.g. ScadaBridge's + /// "ScadaBridge:Logging:MinimumLevel") apply that override themselves before or after + /// calling AddZbSerilog — this remains per-project and AddZbSerilog does not read it. + /// /// OTel log export is wired automatically: logs flow through the OTel pipeline with the same /// Resource as the metrics and traces (all three signals correlated in a backend). /// diff --git a/components/observability/spec/METRIC-CONVENTIONS.md b/components/observability/spec/METRIC-CONVENTIONS.md index 07cfd2a..077aa86 100644 --- a/components/observability/spec/METRIC-CONVENTIONS.md +++ b/components/observability/spec/METRIC-CONVENTIONS.md @@ -48,16 +48,16 @@ dot-separated: | Instrument name | App | Meaning | |---|---|---| | `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space | -| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions | -| `mxgateway.session.active` | MxGateway | Active MxAccess sessions | -| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker | +| `otopcua.deploy.apply.duration` | OtOpcUa | End-to-end deploy apply duration | +| `mxgateway.sessions.open` | MxGateway | Currently open MxAccess sessions | +| `mxgateway.commands.duration` | MxGateway | End-to-end MXAccess command latency | | `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL | **Rules:** 1. All lower-case. No camelCase, no PascalCase, no hyphens. 2. Three segments minimum (`..`). Four are permitted when the - subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`). + subsystem warrants a sub-area (e.g. `mxgateway.commands.duration`). 3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`, `duration`), not implementation details (`method_called`, `loop_iteration`). 4. Counters: past-tense or noun (`received`, `errors`, `applied`). @@ -122,14 +122,15 @@ HTTP / gRPC request traces across the fleet) is high. | Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today | |---|---|---|---|---| -| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — | -| HttpClient | Traces + Metrics | ✅ | ✅ | — | -| gRPC client | Traces | ✅ | — | — | -| .NET runtime | Metrics | ✅ | — | — | -| Process | Metrics | ✅ | — | — | +| ASP.NET Core | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added | +| HttpClient | Traces + Metrics | ⛔ not added | ⛔ not added | ⛔ not added | +| gRPC client | Traces | ⛔ not added | ⛔ not added | n/a | +| .NET runtime | Metrics | ⛔ not added | ⛔ not added | ⛔ not added | +| Process | Metrics | ⛔ not added | ⛔ not added | ⛔ not added | -OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through -`AddZbTelemetry`. No project removes any of these. +All three projects lack standard instrumentation today — it is added automatically when each +project calls `AddZbTelemetry` (Gap S1 in `GAPS.md`). No project removes any of these once +wired. --- @@ -141,63 +142,72 @@ surface that each project registers through `o.Meters` / `o.ActivitySources` in ### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs` +(Code-verified 2026-06-01 — see `current-state/otopcua/CURRENT-STATE.md`.) + +**Counters (7):** | Instrument | Kind | Unit | Description | |---|---|---|---| -| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space | -| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing | -| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions | -| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations | -| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations | -| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions | -| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway | +| `otopcua.deploy.applied` | Counter | — | Galaxy deploy events applied to the OPC UA address space | +| `otopcua.driver.lifecycle` | Counter | — | Driver lifecycle events (start / stop / restart) | +| `otopcua.virtualtag.eval` | Counter | — | Virtual tag evaluations | +| `otopcua.scriptedalarm.transition` | Counter | — | Scripted alarm state transitions | +| `otopcua.opcua.sink.write` | Counter | — | OPC UA sink write operations | +| `otopcua.redundancy.service_level_change` | Counter | — | Redundancy service-level changes | + +**Histograms (1):** + +| Instrument | Kind | Unit | Description | +|---|---|---|---| +| `otopcua.deploy.apply.duration` | Histogram | `s` | End-to-end deploy apply duration | **ActivitySources (spans):** -| Source name | Span(s) | +| Source name | Spans | |---|---| -| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` | +| `ZB.MOM.WW.OtOpcUa` | `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild` | -All durations already use `"s"` — no convergence item for OtOpcUa. +All durations use `"s"` — no unit convergence item for OtOpcUa. ### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`) Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs` +(Code-verified 2026-06-01 — see `current-state/mxaccessgw/CURRENT-STATE.md`.) **Counters (13):** | Instrument | Unit | Description | |---|---|---| -| `mxgateway.session.created` | `"1"` | MxAccess sessions opened | -| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed | -| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors | -| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations | -| `mxgateway.command.errors` | `"1"` | Command invocation errors | -| `mxgateway.event.received` | `"1"` | MxAccess events received from worker | -| `mxgateway.event.errors` | `"1"` | Event processing errors | -| `mxgateway.worker.started` | `"1"` | x86 worker processes started | -| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped | -| `mxgateway.worker.errors` | `"1"` | Worker communication errors | -| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs | -| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors | -| `mxgateway.auth.failures` | `"1"` | Authentication failures | +| `mxgateway.sessions.opened` | `"1"` | New session requests | +| `mxgateway.sessions.closed` | `"1"` | Sessions torn down | +| `mxgateway.commands.started` | `"1"` | MXAccess command dispatched | +| `mxgateway.commands.succeeded` | `"1"` | Command completed OK | +| `mxgateway.commands.failed` | `"1"` | Command error | +| `mxgateway.events.received` | `"1"` | MXAccess events received from worker | +| `mxgateway.queues.overflows` | `"1"` | Queue overflow (backpressure) | +| `mxgateway.faults` | `"1"` | Unhandled gateway faults | +| `mxgateway.workers.killed` | `"1"` | Worker process forcibly terminated | +| `mxgateway.workers.exited` | `"1"` | Worker process exited cleanly | +| `mxgateway.heartbeats.failed` | `"1"` | Worker heartbeat timeouts | +| `mxgateway.grpc.streams.disconnected` | `"1"` | gRPC event stream disconnects | +| `mxgateway.retries.attempted` | `"1"` | Retry attempts (any subsystem) | -**Histograms (3):** +**Histograms (3) — current unit `ms` (convergence target `s`):** -| Instrument | Unit | Current unit | Convergence | +| Instrument | Target unit | Current unit | Convergence | |---|---|---|---| -| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | -| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | -| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | +| `mxgateway.workers.startup.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | +| `mxgateway.commands.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | +| `mxgateway.events.stream_send.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | -**Gauges (4):** +**Observable gauges (4):** | Instrument | Unit | Description | |---|---|---| -| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions | -| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes | -| `mxgateway.worker.memory` | `"By"` | Worker process RSS | -| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache | +| `mxgateway.sessions.open` | `"1"` | Currently open sessions (live count) | +| `mxgateway.workers.running` | `"1"` | Currently running worker processes | +| `mxgateway.events.worker_queue.depth` | `"1"` | Per-worker event queue depth | +| `mxgateway.events.grpc_stream_queue.depth` | `"1"` | Per-stream gRPC send queue depth | No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource is left per-project (deferred to GAPS backlog). diff --git a/components/observability/spec/SPEC.md b/components/observability/spec/SPEC.md index e5f460a..f86bfab 100644 --- a/components/observability/spec/SPEC.md +++ b/components/observability/spec/SPEC.md @@ -13,8 +13,9 @@ logs) via a single `AddZbTelemetry` extension; the shared `Resource` attribute s `host.name`) that makes every node distinguishable in a collector; standard instrumentation everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap -with identity enrichers (`SiteId`, `NodeRole`, `NodeHostname`) bound from the same options -object as the OTel Resource (metrics and logs therefore carry identical dimensions); a +with identity enrichers (`SiteId`, `NodeRole` from `ZbTelemetryOptions`; `NodeHostname` auto +from `Environment.MachineName`) matching the OTel Resource dimensions (metrics and logs +therefore carry identical dimensions); a `TraceContextEnricher` that stamps `trace_id`/`span_id` from `Activity.Current` onto every Serilog event, enabling log↔trace correlation; an `ILogRedactor` redaction seam. @@ -53,6 +54,11 @@ This is the headline fix: nobody in the fleet sets a `Resource` or `service.name making every node indistinguishable in a collector. Every project must call `AddZbTelemetry` to be observable. +> **`IServiceCollection` overload:** `AddZbTelemetry` also has an `IServiceCollection`-based +> overload for host configurations where `IHostApplicationBuilder` is not available (detailed in +> the shared-contract). The `IHostApplicationBuilder` overload is the primary path for all three +> apps on .NET 10. + ## 2. Shared Resource The OTel `Resource` attached to all three signals is built from `ZbTelemetryOptions`: @@ -119,15 +125,22 @@ project's bespoke logging bootstrap with a shared two-stage pattern: | `TraceContextEnricher` | `trace_id`, `span_id` | `Activity.Current` | | `RedactionEnricher` | _(project-defined fields)_ | `ILogRedactor` implementation | -The three identity properties (`SiteId`, `NodeRole`, `NodeHostname`) are bound from the -same `ZbTelemetryOptions` object as the OTel `Resource`, so logs and metrics/traces carry -identical dimensions. When no `Activity.Current` is present (e.g. background services, +`SiteId` and `NodeRole` are bound from the same `ZbTelemetryOptions` object as the OTel +`Resource`; `NodeHostname` is populated automatically from `Environment.MachineName` (not a +caller-supplied option). All three identity properties appear on logs and metrics/traces alike, +so signals from the same node carry identical dimensions. When no `Activity.Current` is present (e.g. background services, startup), `TraceContextEnricher` emits nothing — it does not inject empty or zero values. `MinimumLevel` is set explicitly in code (default `Information`) and can be overridden via `IConfiguration` (`Serilog:MinimumLevel`). Sinks are fully config-driven: `ReadFrom.Configuration` reads `Serilog:WriteTo` from `appsettings.json` / environment. +> **Per-project config paths:** `AddZbSerilog` reads `Serilog:MinimumLevel` from `IConfiguration`. +> Callers that bind MinimumLevel from a different key (e.g. ScadaBridge's +> `ScadaBridge:Logging:MinimumLevel`) apply that override themselves before or after calling +> `AddZbSerilog`. The config key for MinimumLevel remains per-project; `AddZbSerilog` is not +> parameterized on it. + OTel log export is wired in the same call: logs flow through the OTel pipeline with the same `Resource` attached, making all three signals (metrics / traces / logs) available in a single backend.