diff --git a/components/observability/current-state/mxaccessgw/CURRENT-STATE.md b/components/observability/current-state/mxaccessgw/CURRENT-STATE.md new file mode 100644 index 0000000..f1bf81d --- /dev/null +++ b/components/observability/current-state/mxaccessgw/CURRENT-STATE.md @@ -0,0 +1,191 @@ +# Observability — current state: MxAccessGateway + +Repo: `~/Desktop/MxAccessGateway`. Stack: .NET 10 gateway (x64) + .NET 4.8 worker (**x86**); +solution `src/MxGateway.sln`. Telemetry code is concentrated in +`src/ZB.MOM.WW.MxGateway.Server/Metrics/` (instruments) and +`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/` (logging correlation + redaction). +All paths relative to repo root. Verified 2026-06-01. + +The most unusual observability posture in the family: **13 counters, 3 histograms, and 4 observable +gauges** all fully hand-rolled using `System.Diagnostics.Metrics` directly — but **never exported** +(no OpenTelemetry SDK, no Prometheus exporter, no OTLP). All metric data dies in an in-memory +`GetSnapshot()`. Logging is `Microsoft.Extensions.Logging` exclusively (no Serilog), with a bespoke +correlation scope and a log-redaction pipeline. The net48 x86 worker is out of process and out of +scope — its `IWorkerLogger` (stderr key=value) is not addressed here. + +## 1. Metrics (hand-rolled, unexported) + +### `GatewayMetrics.cs` + +`src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`: + +Meter name: `"MxGateway.Server"` (does not follow the project namespace `ZB.MOM.WW.MxGateway`). + +All instruments are instance members of `GatewayMetrics`. The class is registered as a **singleton** +at `GatewayApplication.cs:62`. There is **no `OpenTelemetry.Extensions.Hosting`**, +**no `AddOpenTelemetry()` call**, and **no exporter** — the `Meter` is created with +`new Meter("MxGateway.Server")` and `GetSnapshot()` is the only read path. + +**Counters (13):** + +| Instrument name | Tracks | +|---|---| +| `mxgateway.sessions.opened` | New session requests | +| `mxgateway.sessions.closed` | Sessions torn down | +| `mxgateway.commands.started` | MXAccess command dispatched | +| `mxgateway.commands.succeeded` | Command completed OK | +| `mxgateway.commands.failed` | Command error | +| `mxgateway.events.received` | MXAccess events from worker | +| `mxgateway.queues.overflows` | Queue overflow (backpressure) | +| `mxgateway.faults` | Unhandled gateway faults | +| `mxgateway.workers.killed` | Worker process forcibly terminated | +| `mxgateway.workers.exited` | Worker process exited cleanly | +| `mxgateway.heartbeats.failed` | Worker heartbeat timeouts | +| `mxgateway.grpc.streams.disconnected` | gRPC event stream disconnects | +| `mxgateway.retries.attempted` | Retry attempts (any subsystem) | + +**Histograms (3) — unit `ms` (diverges from OTel semconv `s`):** + +| Instrument name | Tracks | +|---|---| +| `mxgateway.workers.startup.duration` | Time from worker spawn to ready | +| `mxgateway.commands.duration` | End-to-end MXAccess command latency | +| `mxgateway.events.stream_send.duration` | gRPC event stream send latency | + +**Observable gauges (4):** + +| Instrument name | Tracks | +|---|---| +| `mxgateway.sessions.open` | Currently open sessions (live count) | +| `mxgateway.workers.running` | Currently running worker processes | +| `mxgateway.events.worker_queue.depth` | Per-worker event queue depth | +| `mxgateway.events.grpc_stream_queue.depth` | Per-stream gRPC send queue depth | + +All 20 instruments share the `mxgateway.*` prefix and `.` naming — consistent +with the family convention. Duration histograms record in **milliseconds** (`ms`); OTel semantic +conventions require seconds (`s`). This is the only project with `ms` histograms. + +### Singleton wiring + +`src/ZB.MOM.WW.MxGateway.Server/GatewayApplication.cs`: +- `:62` — `services.AddSingleton()` registers the metrics singleton. + +There is no `AddOpenTelemetry()` call anywhere in the gateway. The `GatewayMetrics` `Meter` is +created independently of any OTel SDK — it participates in `MeterListener` / `GetSnapshot()` only. +Without the OTel SDK, this data is **invisible to Prometheus, OTLP, or any backend**. + +### No tracing + +No `ActivitySource` is defined. No spans are created. Tracing is entirely absent. + +## 2. Logging (Microsoft.Extensions.Logging) + +All logging in the gateway server uses `Microsoft.Extensions.Logging` (MEL) exclusively. There is +no Serilog dependency. Sink configuration lives in `appsettings.json` (Console, with structured +logging via the default host builder). + +### Correlation scope + +`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogScope.cs`: + +Defines the per-request/per-session correlation property bag. + +`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayRequestLoggingMiddlewareExtensions.cs`: +- `:22–41` — `UseGatewayRequestLogging()` middleware reads the following HTTP headers from each + incoming request: `x-session-id`, `x-worker-process-id`, `x-correlation-id`, `x-command-method`, + `authorization` (for redaction, not logging). +- Registered at `GatewayApplication.cs:34`. + +`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLoggerExtensions.cs`: +- `:11–18` — `BeginGatewayScope(ILogger, GatewayLogScope)` calls `logger.BeginScope(scope)` — + MEL's `ILogger.BeginScope` mechanism, which pushes properties as a scoped dictionary. + +The correlation tuple (`SessionId` / `WorkerProcessId` / `CorrelationId` / `CommandMethod`) is +injected into log lines produced within the scope. No `trace_id` / `span_id` enrichment — there +is no ActivitySource, so this is consistent but leaves no path to trace correlation. + +### Log redaction — `GatewayLogRedactor.cs` + +`src/ZB.MOM.WW.MxGateway.Server/Diagnostics/GatewayLogRedactor.cs`: + +- Masks sensitive data in log lines for two categories: + - **`AuthenticateUser`** commands: the password argument is replaced. + - **`WriteSecured`** commands: the value argument is replaced. + - **`mxgw_` bearer tokens**: the token body is masked, keeping only the key-id prefix. +- Redaction is applied before the log event is emitted — no sensitive data reaches the sink. + +This is the only project in the family with an explicit log-redaction pipeline. OtOpcUa and +ScadaBridge have no equivalent. + +## 3. Signal summary + +| Signal | Provider | Export | Resource / service.name | +|---|---|---|---| +| Metrics | `System.Diagnostics.Metrics` (`Meter` direct) | ⛔ none (`GetSnapshot()` only) | ⛔ none | +| Traces | — | ⛔ none | ⛔ none | +| Logs | MEL (`Microsoft.Extensions.Logging`) | Console via `appsettings.json` | ⛔ none | +| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource exists) | + +## 4. Notable design choices + +- **`GatewayMetrics` singleton** — all counter/gauge increments are lock-free atomic operations on + the underlying `Meter` instruments; the singleton is intentional. +- **`ms` histogram unit** — `workers.startup.duration`, `commands.duration`, and + `events.stream_send.duration` all record in milliseconds. This is non-standard (OTel semconv + requires `s`) and means raw values differ from OtOpcUa's `s` histograms by a factor of 1000. +- **MEL correlation via `BeginScope`** — MEL scopes are supported by structured logging providers + (e.g. Serilog.Extensions.Hosting, Seq, Application Insights) but are provider-dependent. The + scope properties may not appear in all sink configurations, unlike Serilog's `LogContext` which + is sink-agnostic. +- **Redaction placement** — `GatewayLogRedactor` sits between the caller and the log emission point, + not inside a sink. This is the correct placement; the shared `ILogRedactor` seam preserves this. + +--- + +## Adoption plan → `ZB.MOM.WW.Telemetry` + +**This is the one in-pass adoption.** The MxGateway MEL → Serilog migration is executed as part of +the `ZB.MOM.WW.Telemetry` library build, not deferred as a follow-on. The changes below land in +the MxAccessGateway repo as part of Task #9 (blocked by Task #8 — library build). + +**Migrate logging MEL → `AddZbSerilog`:** + +- Replace `WebApplicationBuilder` default logging with `builder.AddZbSerilog(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; })`. + Gains structured `SiteId` / `NodeRole` / `NodeHostname` enrichers on every log event, plus + `TraceContextEnricher` (currently moot — no spans — but ready for when tracing is added). +- Re-express the `GatewayLogScope` / `BeginGatewayScope` / `UseGatewayRequestLogging` correlation + mechanism as a Serilog `LogContext.PushProperty` scope. The middleware at + `GatewayRequestLoggingMiddlewareExtensions.cs:22–41` is refactored to push the same four + properties (`SessionId`, `WorkerProcessId`, `CorrelationId`, `CommandMethod`) via Serilog's + `LogContext` rather than MEL `BeginScope`. Behavior is identical; portability improves. +- Move `GatewayLogRedactor` behind the shared `ILogRedactor` seam. The redaction policy (which + commands/tokens to scrub and how) stays per-project in a `MxGatewayLogRedactor : ILogRedactor` + implementation; the seam is shared. +- Console + file sinks configured via `ReadFrom.Configuration` in `appsettings.json` — consistent + with OtOpcUa and ScadaBridge's Serilog approach. + +**Wire metrics export via `AddZbTelemetry`:** + +- Add `builder.AddZbTelemetry(o => { o.ServiceName = "mxgateway"; o.SiteId = ...; o.NodeRole = ...; o.Meters = ["MxGateway.Server"]; })`. + This registers the OTel SDK and connects `GatewayMetrics`'s existing `Meter` to the Prometheus + exporter. The 13 counters, 3 histograms, and 4 gauges **begin exporting** for the first time. + `GatewayMetrics.cs` itself is unchanged — only the SDK layer is added around it. +- Add `app.MapZbMetrics()` to expose `/metrics`. + +**Convert histogram unit `ms` → `s`:** + +- Rename the three histograms' values: multiply recorded values by `0.001` at the call site, or + re-create the instruments with unit `s`. This is a breaking change to existing dashboards/alerts + but required for OTel semconv compliance. Tagged as a convergence item in `GAPS.md`. + +**Keep bespoke:** + +- `GatewayMetrics.cs` — all 20 instruments (`mxgateway.*` counters, histograms, gauges) stay + per-project. `AddZbTelemetry` registers the Meter name; it does not own or replace the instruments. +- Meter name `"MxGateway.Server"` — a follow-on rename to `"ZB.MOM.WW.MxGateway"` is tracked in + `GAPS.md` but is not required for the initial adoption (it is a Prometheus label change that + breaks existing dashboards). +- `GatewayApplication.cs:62` singleton registration — unchanged; `GatewayMetrics` remains a + singleton; `AddZbTelemetry` simply hooks the OTel SDK to it. +- The net48 x86 worker's `IWorkerLogger` (stderr key=value) — out of process and out of scope. + No changes. diff --git a/components/observability/current-state/otopcua/CURRENT-STATE.md b/components/observability/current-state/otopcua/CURRENT-STATE.md new file mode 100644 index 0000000..df33fcc --- /dev/null +++ b/components/observability/current-state/otopcua/CURRENT-STATE.md @@ -0,0 +1,158 @@ +# Observability — current state: OtOpcUa + +Repo: `~/Desktop/OtOpcUa`. Stack: .NET 10, Akka.NET, OPC UA; solution `ZB.MOM.WW.OtOpcUa.slnx`. +Telemetry code lives in two places: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/` (host-side +bootstrap) and `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/` (instruments + enricher). +All paths relative to repo root. Verified 2026-06-01. + +The most complete observability implementation in the family: OpenTelemetry SDK with both metrics and +tracing signals, Prometheus export, Serilog structured logging with a per-session correlation enricher, +and a dedicated instrument vocabulary. The one significant gap: **no OTel Resource / `service.name`**, +so all signals are indistinguishable from one another and from other fleet members in a backend. + +## 1. Metrics (OpenTelemetry SDK) + +### Bootstrap — `ObservabilityExtensions.cs` + +`src/Server/ZB.MOM.WW.OtOpcUa.Host/Observability/ObservabilityExtensions.cs`: + +- `:18` — `AddOtOpcUaObservability(IServiceCollection)` is the service-registration entry point. +- `:20` — `AddOpenTelemetry()` wires the OTel SDK. +- `:21–23` — `.WithMetrics(b => b.AddMeter(OtOpcUaTelemetry.MeterName).AddPrometheusExporter())`: + registers the application meter and attaches the Prometheus scrape exporter. +- `:24–25` — `.WithTracing(b => b.AddSource(OtOpcUaTelemetry.ActivitySourceName))`: + registers the application activity source for trace data. +- **No `ResourceBuilder` call anywhere** — `service.name`, `service.namespace`, `service.version`, + `site.id`, and `node.role` are not set. The OTel SDK defaults to an empty/SDK-default Resource. +- `:36` — `MapOtOpcUaMetrics(IEndpointRouteBuilder)` maps the Prometheus endpoint. +- `:38` — endpoint path is `/metrics`. + +`Program.cs`: +- `:138` — `builder.Services.AddOtOpcUaObservability()` +- `:160` — `app.MapOtOpcUaMetrics()` + +Package refs in csproj: `OpenTelemetry.Extensions.Hosting`, `OpenTelemetry.Exporter.Prometheus.AspNetCore`. +**No `OpenTelemetry.Exporter.OpenTelemetryProtocol`** — OTLP is not available; Prometheus is the +only export path. + +### Instruments — `OtOpcUaTelemetry.cs` + +`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`: + +- `:19` — `MeterName = "ZB.MOM.WW.OtOpcUa"` (the `Meter` the SDK will collect). +- `:20` — `ActivitySourceName = "ZB.MOM.WW.OtOpcUa"` (the `ActivitySource` for spans). + +Instruments defined (all `static readonly` on `OtOpcUaTelemetry`): + +| Instrument | Kind | Unit | Subsystem | +|---|---|---|---| +| `otopcua.deploy.applied` | `Counter` | — | deploy | +| `otopcua.deploy.apply.duration` | `Histogram` | `s` | deploy | +| `otopcua.driver.lifecycle` | `Counter` | — | driver | +| `otopcua.virtualtag.eval` | `Counter` | — | virtual-tag | +| `otopcua.scriptedalarm.transition` | `Counter` | — | scripted-alarm | +| `otopcua.opcua.sink.write` | `Counter` | — | opc-ua sink | +| `otopcua.redundancy.service_level_change` | `Counter` | — | redundancy | + +Two activity spans: `otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`. + +Naming convention: `otopcua..`. Duration histogram correctly uses unit `s` +(OTel semantic conventions). **No standard instrumentation** (ASP.NET Core, HttpClient, runtime, +gRPC client meters) is wired — only the bespoke application instruments. + +## 2. Logging (Serilog) + +### Bootstrap + +`Program.cs`: +- `:49–52` — two-stage Serilog bootstrap: initial logger for startup, then full + `UseSerilog(ReadFrom.Configuration)`. Sinks: Console + rolling file `logs/otopcua-.log`. +- `:141` — `UseSerilogRequestLogging()` on the `WebApplication`. + +### Correlation enricher — `LogContextEnricher.cs` + +`src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/LogContextEnricher.cs`: + +- `:18–36` — `Push(driverInstanceId, driverType, capability, correlationId)` calls + `LogContext.PushProperty` for four properties: + - `DriverInstanceId` — Galaxy driver instance GUID. + - `DriverType` — driver type discriminator. + - `CapabilityName` — OPC UA capability being exercised. + - `CorrelationId` — caller-supplied correlation token. + +This enricher is driver-lifecycle-scoped, not request-scoped — it pushes when a driver operation +begins and is disposable to pop on completion. + +**No `trace_id` / `span_id` enricher.** Although OtOpcUa creates `ActivitySource` spans, the +active `Activity.Current` trace context is never pushed onto Serilog's `LogContext`. A log line +emitted during a span cannot be correlated to the span in a backend. + +**No structural enrichers for `service.name` / `site.id` / `node.role`** — these dimensions are +absent from every log line. ScadaBridge has these; OtOpcUa does not. + +## 3. Signal summary + +| Signal | Provider | Export | Resource / service.name | +|---|---|---|---| +| Metrics | OTel SDK (`Meter` + `WithMetrics`) | Prometheus `/metrics` | ⛔ none | +| Traces | OTel SDK (`ActivitySource` + `WithTracing`) | ⛔ none (no exporter configured) | ⛔ none | +| Logs | Serilog | Console + rolling file | ⛔ none (no `service.name` property) | +| Trace↔log correlation | — | — | ⛔ absent (`trace_id`/`span_id` not pushed) | + +Note: `WithTracing` registers the `ActivitySource` for collection, but no exporter (OTLP or +otherwise) is attached to the tracing pipeline. Spans are created and recorded by the SDK but never +shipped anywhere — effectively a no-op in production. + +## 4. Notable design choices + +- **Instrument naming** follows `..` cleanly and consistently — this is the + pattern the shared spec codifies as the fleet convention. +- **Duration unit** correctly uses `s` on `otopcua.deploy.apply.duration` — no conversion needed on + adoption; this contrasts with MxAccessGateway's `ms` histograms. +- **LogContextEnricher is bespoke but valuable** — the `DriverInstanceId`/`DriverType`/`CapabilityName` + correlation is OtOpcUa-specific domain context; it should survive adoption behind the shared + enricher layer. +- **No OTLP path** — with no OTLP exporter, OtOpcUa cannot send metrics or traces to a collector + (Prometheus is scrape-pull only). This limits operational flexibility. + +--- + +## Adoption plan → `ZB.MOM.WW.Telemetry` + +**Replace with shared bootstrap:** + +- `AddOtOpcUaObservability()` → `builder.AddZbTelemetry(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; o.Meters = [OtOpcUaTelemetry.MeterName]; o.ActivitySources = [OtOpcUaTelemetry.ActivitySourceName]; })`. + This adds the missing `Resource` (gains `service.name` / `service.namespace` / `service.version` / + `site.id` / `node.role` / `host.name` on every metric and span). Prometheus `/metrics` stays the + default exporter; OTLP becomes opt-in via options. +- Add standard instrumentation through `AddZbTelemetry` options: ASP.NET Core meters, HttpClient, + runtime + process meters — none wired today. +- Fix the tracing no-op: wire an OTLP exporter (or at minimum note that tracing is recorded but not + exported); `AddZbTelemetry` provides OTLP as the opt-in path. +- `MapOtOpcUaMetrics` → `app.MapZbMetrics()` (same `/metrics` path; shared convention). + +**Replace with shared Serilog bootstrap:** + +- Serilog bootstrap in `Program.cs:49–52` → `builder.AddZbSerilog(o => { o.ServiceName = "otopcua"; o.SiteId = ...; o.NodeRole = ...; })`. + This adds structural `SiteId` / `NodeRole` / `NodeHostname` properties to every log line + (currently absent) and wires the `TraceContextEnricher` so `trace_id`/`span_id` appear on log + lines emitted during active spans. +- Console + file sinks continue via `ReadFrom.Configuration` in `appsettings.json` — no sink changes + needed. +- `UseSerilogRequestLogging()` stays. + +**Keep bespoke:** + +- `OtOpcUaTelemetry.cs` — the application `Meter`, `ActivitySource`, and all instrument definitions + (`otopcua.*` counters, histograms, spans). These are domain instruments; `AddZbTelemetry` registers + them by name but does not own them. +- `LogContextEnricher.cs` — driver-lifecycle correlation properties (`DriverInstanceId`, + `DriverType`, `CapabilityName`, `CorrelationId`) are OtOpcUa-specific. The enricher continues to + push via `LogContext.PushProperty` alongside the shared enrichers. +- `ObservabilityExtensions.cs` itself can be simplified or removed — it becomes a thin wrapper that + calls `AddZbTelemetry` with OtOpcUa-specific options. The per-project entry point remains; only + the implementation body is delegated to the shared library. + +**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry` +library build. The library build delivers the shared bootstrap and enrichers; adoption lands in the +OtOpcUa repo as a separate commit once the nupkg is available. diff --git a/components/observability/current-state/scadabridge/CURRENT-STATE.md b/components/observability/current-state/scadabridge/CURRENT-STATE.md new file mode 100644 index 0000000..843bf76 --- /dev/null +++ b/components/observability/current-state/scadabridge/CURRENT-STATE.md @@ -0,0 +1,151 @@ +# Observability — current state: ScadaBridge + +Repo: `~/Desktop/ScadaBridge`. Stack: .NET 10, Akka.NET, Docker; solution +`ZB.MOM.WW.ScadaBridge.slnx`. The telemetry posture is split across a dangling OTel package ref +(metrics/traces) and a substantive Serilog setup (logs). All paths relative to repo root. +Verified 2026-06-01. + +Structurally the cleanest logging enricher set in the family — `SiteId` / `NodeRole` / +`NodeHostname` are already first-class Serilog enricher properties — but the weakest on +metrics/tracing: zero instrumentation. The `OpenTelemetry.Api` package reference is a CVE-patch +artefact, not instrumentation. + +## 1. Metrics and traces (absent) + +### `OpenTelemetry.Api` — CVE-patch ref, not instrumentation + +`src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj`: +- `:31` — `` — a **direct version override** added + to satisfy GHSA-g94r-2vxg-569j / GHSA-8785-wc3w-h8q6 (OpenTelemetry 1.9.0 CVEs introduced via + `Akka.Hosting`'s pinned transitive dependency). + +There is **no `AddOpenTelemetry()` call** in the solution. No `Meter` is created. No +`ActivitySource` is declared. No exporter is configured. The package reference solely overrides the +transitive version — it has no runtime effect on observability. + +### Instrument coverage + +Zero application instruments. There is no custom `Meter`, no counter, no histogram, no gauge, and +no span in the ScadaBridge codebase. This is the largest gap in the family. + +## 2. Logging (Serilog — strongest enricher set) + +### Two-stage bootstrap + +`src/ZB.MOM.WW.ScadaBridge.Host/Program.cs`: +- `:27–54` — two-stage Serilog bootstrap: an initial logger is created for startup messages before + the host is built; the full logger replaces it during `UseSerilog`. + +### `LoggerConfigurationFactory.cs` + +`src/ZB.MOM.WW.ScadaBridge.Host/LoggerConfigurationFactory.cs`: + +Full factory method signature: `Build(IConfiguration config, string nodeRole, string siteId, string nodeHostname)`. + +- `:62` — reads `ScadaBridge:Logging:MinimumLevel` from configuration. +- `:84` — `ReadFrom.Configuration(config)` pulls sink configuration from `appsettings.json`. +- `:85` — explicit `MinimumLevel.Is(...)` override from the typed option. +- `:86–88` — three structural enrichers: + - `.Enrich.WithProperty("SiteId", siteId)` — site identifier (e.g. `"site-a"`). + - `.Enrich.WithProperty("NodeHostname", nodeHostname)` — node hostname. + - `.Enrich.WithProperty("NodeRole", nodeRole)` — Akka cluster role (e.g. `"central"`, `"site"`). + +These three properties are the cleanest and most complete set in the family. ScadaBridge's property +names (`SiteId` / `NodeRole` / `NodeHostname`) are also the ones the shared `AddZbTelemetry` +options object maps onto `site.id` / `node.role` / `host.name` OTel Resource attributes — no +renaming needed on adoption. + +### Sink configuration + +`appsettings.json:3–23` — Serilog sinks configured via `ReadFrom.Configuration`: +- Console sink with output template that includes `[{NodeRole}/{NodeHostname}]`. +- File sink (path in config; rolling interval). + +### `LoggingOptions.cs` + +`src/ZB.MOM.WW.ScadaBridge.Host/LoggingOptions.cs`: +- `MinimumLevel` — config-bound minimum level; default `Information`. + +### Missing elements + +- **No custom enrichers** beyond the three structural properties. `LogContextEnricher` (OtOpcUa's + driver-correlation enricher) has no equivalent; MxGateway's per-session correlation scope has no + equivalent. Per-request/per-operation correlation is not present. +- **No `trace_id` / `span_id` enricher.** As with the other two projects, log lines do not carry + trace context. Because ScadaBridge has zero `ActivitySource` instrumentation, this is consistent — + but it means no trace↔log correlation path exists even hypothetically. + +## 3. Signal summary + +| Signal | Provider | Export | Resource / service.name | +|---|---|---|---| +| Metrics | ⛔ none | ⛔ none | ⛔ none | +| Traces | ⛔ none | ⛔ none | ⛔ none | +| Logs | Serilog | Console + file (`appsettings.json`) | ⛔ none (no `service.name` property) | +| Trace↔log correlation | — | — | ⛔ absent (no ActivitySource; no enricher) | + +## 4. Notable design choices + +- **`SiteId` / `NodeRole` / `NodeHostname` as first-class enrichers** — unlike OtOpcUa's driver- + scoped `LogContextEnricher`, ScadaBridge's structural enrichers are attached at logger creation and + appear on every log line from the process. This is the target pattern for the shared bootstrap. +- **`nodeRole` + `siteId` passed into the factory** — ScadaBridge's `LoggerConfigurationFactory.Build` + takes these as constructor arguments rather than reading them from a registered options object. + The shared `AddZbSerilog` approach binds them from the same `ZbTelemetryOptions` used for the OTel + Resource, unifying the source. +- **Config-driven `MinimumLevel`** — `ScadaBridge:Logging:MinimumLevel` is a typed config path; + `ReadFrom.Configuration` for sinks. The shared bootstrap's `AddZbSerilog` must support the same + pattern. +- **No custom enrichers** — ScadaBridge's logging is intentionally minimal on operation-scoped + context. Correlation in the distributed model is provided by structured log fields from Akka + actor context, not a log enricher pipeline. +- **CVE-patch ref discipline** — the `OpenTelemetry.Api` pin is a responsible CVE response but + leaves the telemetry story incomplete. On adoption, the CVE pin is superseded by the full OTel SDK + pulled in by `AddZbTelemetry`; the explicit `` override can be removed. + +--- + +## Adoption plan → `ZB.MOM.WW.Telemetry` + +**Replace CVE-patch ref with full OTel SDK via `AddZbTelemetry`:** + +- Remove the lone `OpenTelemetry.Api` override from + `src/ZB.MOM.WW.ScadaBridge.Host/ZB.MOM.WW.ScadaBridge.Host.csproj:31`. +- Add `builder.AddZbTelemetry(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.Meters = ["ZB.MOM.WW.ScadaBridge"]; })`. + The full OTel SDK supersedes the transitive version override; the CVE is resolved transitively + via the SDK's current dependency. + +**Add first application instruments:** + +- Define a `ScadaBridgeTelemetry` class (mirror `OtOpcUaTelemetry`) with a `Meter` named + `"ZB.MOM.WW.ScadaBridge"` and an initial set of instruments covering the most observable + operations: site connection lifecycle, alarm received, data-change received, actor supervision + events. Naming convention: `scadabridge..`. +- Register the meter name in `AddZbTelemetry` options. Expose `/metrics` via `app.MapZbMetrics()`. + ScadaBridge goes from zero instrumentation to a baseline exportable set. + +**Adopt `AddZbSerilog`:** + +- Replace the `LoggerConfigurationFactory.Build(config, nodeRole, siteId, nodeHostname)` call in + `Program.cs:27–54` with `builder.AddZbSerilog(o => { o.ServiceName = "scadabridge"; o.SiteId = cfg.SiteId; o.NodeRole = cfg.NodeRole; o.NodeHostname = cfg.NodeHostname; })`. + The three enrichers (`SiteId`, `NodeRole`, `NodeHostname`) are now provided by the shared + `AddZbSerilog` path; `LoggerConfigurationFactory` can be deleted. +- `ReadFrom.Configuration` for sinks and `MinimumLevel.Is` override from config are preserved + inside `AddZbSerilog` — behavior is unchanged. +- The `TraceContextEnricher` is wired automatically by `AddZbSerilog`; once application instruments + are added (above), `trace_id` / `span_id` will appear on log lines emitted during spans. + +**Keep bespoke:** + +- `LoggingOptions.cs` — the `MinimumLevel` typed option and its config path + (`ScadaBridge:Logging:MinimumLevel`) remain; `AddZbSerilog` must accept the minimum-level + override from configuration. The config path stays ScadaBridge's own. +- Console output template including `[{NodeRole}/{NodeHostname}]` — driven by `appsettings.json`; + no change. +- Akka actor-context log fields — per-operation context emitted by Akka infrastructure; not an + enricher concern. +- `ZB.MOM.WW.ScadaBridge.Host.csproj` package set otherwise — no other changes to the project file. + +**Adoption is a follow-on task** (tracked in `GAPS.md`), not part of the `ZB.MOM.WW.Telemetry` +library build. Adding instruments and adopting `AddZbSerilog`/`AddZbTelemetry` lands in the +ScadaBridge repo as a separate commit once the nupkg is available. diff --git a/components/observability/shared-contract/ZB.MOM.WW.Telemetry.md b/components/observability/shared-contract/ZB.MOM.WW.Telemetry.md new file mode 100644 index 0000000..6748297 --- /dev/null +++ b/components/observability/shared-contract/ZB.MOM.WW.Telemetry.md @@ -0,0 +1,248 @@ +# Proposed shared library: `ZB.MOM.WW.Telemetry` + +A contract on paper — the public surface to extract so the three projects stop implementing +observability separately. Realizes [`../spec/SPEC.md`](../spec/SPEC.md) and +[`../spec/METRIC-CONVENTIONS.md`](../spec/METRIC-CONVENTIONS.md). **Not yet created.** +Reference implementations already exist: OtOpcUa `ObservabilityExtensions.cs` (OTel + Serilog), +ScadaBridge `LoggerConfigurationFactory.cs` (Serilog enrichers), MxGateway +`GatewayMetrics.cs` + `GatewayLogRedactor.cs`. + +## Packages (.NET 10) + +``` +ZB.MOM.WW.Telemetry # OTel bootstrap: Resource, metrics, traces, exporters +ZB.MOM.WW.Telemetry.Serilog # Serilog bootstrap: enrichers, TraceContextEnricher, ILogRedactor +``` + +Both packages are .NET 10 — all three logging-bearing processes are .NET 10 (OtOpcUa server, +mxaccessgw gateway, ScadaBridge central). The x86 net48 mxaccessgw worker uses a bespoke +`IWorkerLogger` (stderr key=value); net48 multi-targeting is **not** required. Published to +the Gitea NuGet feed; SemVer; lockstep to start. + +## Packaging & distribution + +**Two NuGet packages, one DLL each**, on the Gitea NuGet feed. Libraries linked into each +app — there is no central telemetry service. Both packages are consumed by all three apps +after adoption: + +| Package (→ DLL) | Transitive deps | OtOpcUa | MxGateway | ScadaBridge | +|---|---|---|---|---| +| `…Telemetry` | OpenTelemetry SDK, `OpenTelemetry.Exporter.Prometheus.AspNetCore`, `OpenTelemetry.Exporter.OpenTelemetryProtocol`, standard instrumentation packages | ✅ | ✅ | ✅ | +| `…Telemetry.Serilog` | Serilog, `Serilog.Extensions.Hosting`, `Serilog.AspNetCore` (version note below) | ✅ | ✅ | ✅ | + +> **`Serilog.AspNetCore` version split (open convergence note):** OtOpcUa and ScadaBridge +> target .NET 10 and may use `Serilog.AspNetCore` 9.x; MxGateway's adoption starts from +> `Serilog.AspNetCore` 9.x as well. If a project remains on .NET 8 ASP.NET Core for any +> reason, the compatible version is `Serilog.AspNetCore` 8.x. Coordinate the version floor +> when the first app takes a dependency and pin it in `Directory.Packages.props`. + +--- + +## `ZB.MOM.WW.Telemetry` + +```csharp +namespace ZB.MOM.WW.Telemetry; + +/// Selects how instrumentation data is exported. +public enum ZbExporter +{ + /// Prometheus scrape endpoint (default). Call app.MapZbMetrics() to mount /metrics. + Prometheus, + + /// OTLP gRPC export. Set OtlpEndpoint (e.g. "http://collector:4317"). + /// Coexists with Prometheus when both endpoints are desired. + Otlp, +} + +/// Options for AddZbTelemetry. All properties feed the shared OTel Resource and +/// Serilog enrichers (via AddZbSerilog in the .Serilog package). +public sealed class ZbTelemetryOptions +{ + /// Required. Short lower-case app identifier — e.g. "otopcua", "mxgateway", "scadabridge". + /// Populates OTel Resource service.name. + public string ServiceName { get; set; } = ""; + + /// Fleet-wide namespace. Default "ZB.MOM.WW". Do not override per-app. + /// Populates OTel Resource service.namespace. + public string ServiceNamespace { get; set; } = "ZB.MOM.WW"; + + /// Optional. Populate from AssemblyInformationalVersion. + /// Populates OTel Resource service.version. + public string? ServiceVersion { get; set; } + + /// Optional. Physical or logical site identifier. + /// Populates OTel Resource site.id and Serilog property SiteId. + public string? SiteId { get; set; } + + /// Optional. Node function: "central", "site", "hub", "standalone". + /// Populates OTel Resource node.role and Serilog property NodeRole. + public string? NodeRole { get; set; } + + /// App-specific Meter names to register with the OTel MeterProvider. + /// Always register the app's primary Meter here. Standard instrumentation meters are + /// added automatically (ASP.NET Core, HttpClient, runtime, process). + public string[] Meters { get; set; } = []; + + /// App-specific ActivitySource names to register with the OTel TracerProvider. + public string[] ActivitySources { get; set; } = []; + + /// Export path. Default Prometheus; use Otlp for a real collector. + public ZbExporter Exporter { get; set; } = ZbExporter.Prometheus; + + /// Required when Exporter = ZbExporter.Otlp. + /// OTLP gRPC endpoint, e.g. "http://collector:4317". + public string? OtlpEndpoint { get; set; } +} + +/// Extension point for configuring the OTel bootstrap on an IHostApplicationBuilder. +public static class ZbTelemetryExtensions +{ + /// Configures the OpenTelemetry MeterProvider and TracerProvider with the shared Resource, + /// standard instrumentation (ASP.NET Core, HttpClient, gRPC client, runtime, process), + /// the app's own Meters and ActivitySources, and the selected exporter. + /// Does NOT configure Serilog — call AddZbSerilog() in the .Serilog package for that. + public static IHostApplicationBuilder AddZbTelemetry( + this IHostApplicationBuilder builder, + Action configure); + + /// IServiceCollection overload for contexts where IHostApplicationBuilder is not available. + /// Requires the caller to supply a pre-built ZbTelemetryOptions (Resource attributes must + /// be populated before DI composition, so the options-object overload is preferred). + public static IServiceCollection AddZbTelemetry( + this IServiceCollection services, + ZbTelemetryOptions options); +} + +/// Builds the shared OTel ResourceBuilder from ZbTelemetryOptions. +/// Used internally by AddZbTelemetry. Exposed for tests and custom pipelines. +public static class ZbResource +{ + /// Returns a ResourceBuilder pre-populated with service.name, service.namespace, + /// service.version, site.id, node.role, and host.name (always Environment.MachineName). + /// Attributes with null values are omitted from the Resource. + public static ResourceBuilder Build(ZbTelemetryOptions options); +} + +/// Endpoint extension for mounting the Prometheus /metrics scrape endpoint. +public static class ZbMetricsEndpointExtensions +{ + /// Mounts the Prometheus /metrics endpoint. + /// Only valid when ZbTelemetryOptions.Exporter = ZbExporter.Prometheus (or both). + /// Call after app.UseRouting(). + public static IEndpointConventionBuilder MapZbMetrics( + this IEndpointRouteBuilder endpoints); +} +``` + +--- + +## `ZB.MOM.WW.Telemetry.Serilog` + +```csharp +namespace ZB.MOM.WW.Telemetry.Serilog; + +/// Extension point for configuring the Serilog two-stage bootstrap on an IHostApplicationBuilder. +public static class ZbSerilogExtensions +{ + /// Two-stage Serilog bootstrap: + /// Stage 1 — minimal console-only bootstrap logger (for startup errors before IConfiguration). + /// Stage 2 — application logger wired from IConfiguration (ReadFrom.Configuration reads + /// Serilog:WriteTo sinks + Serilog:MinimumLevel overrides) with fixed enrichers: + /// SiteId, NodeRole, NodeHostname (from ZbTelemetryOptions), TraceContextEnricher, + /// and RedactionEnricher (applied only when ILogRedactor is registered). + /// + /// OTel log export is wired automatically: logs flow through the OTel pipeline with the same + /// Resource as the metrics and traces (all three signals correlated in a backend). + /// + /// The configure delegate receives the same ZbTelemetryOptions used by AddZbTelemetry. + /// Typically share a single options-population lambda across both calls. + public static IHostApplicationBuilder AddZbSerilog( + this IHostApplicationBuilder builder, + Action configure); +} + +/// Canonical Serilog property name constants for the identity enrichers. +/// Use these constants — not literal strings — when querying properties in sinks or tests. +public static class ZbLogEnricherNames +{ + /// Serilog property: physical or logical site identifier. Matches OTel Resource site.id. + public const string SiteId = "SiteId"; + + /// Serilog property: node function (central, site, hub, standalone). Matches OTel node.role. + public const string NodeRole = "NodeRole"; + + /// Serilog property: machine name (Environment.MachineName). Matches OTel host.name. + public const string NodeHostname = "NodeHostname"; +} + +/// Stamps trace_id and span_id from Activity.Current onto every Serilog log event. +/// When Activity.Current is null (no active span — background services, startup, non-traced paths) +/// the enricher emits nothing; it does NOT inject empty strings or zero values. +/// This enables a log line to be clicked through to its originating trace in a backend. +public sealed class TraceContextEnricher : ILogEventEnricher +{ + public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory); +} + +/// Seam for project-specific log-event redaction. +/// The shared library applies this via RedactionEnricher; each project provides its own +/// implementation that knows which fields (by property name) or which command payloads +/// must not leave the process in log events. +/// If no ILogRedactor is registered in DI, RedactionEnricher is a no-op. +public interface ILogRedactor +{ + /// Inspect and mutate properties in-place. Remove or replace any sensitive values. + /// Called on every log event before it reaches any sink. + void Redact(IDictionary properties); +} + +/// Applies a registered ILogRedactor to every Serilog log event. +/// Registered automatically by AddZbSerilog. The enricher resolves ILogRedactor from DI +/// on first use; if none is registered it is permanently inert (no DI call per event). +public sealed class RedactionEnricher : ILogEventEnricher +{ + public RedactionEnricher(IServiceProvider serviceProvider); + public void Enrich(LogEvent logEvent, ILogEventPropertyFactory propertyFactory); +} +``` + +--- + +## Consumer matrix + +| Consumer | Packages | Notes | +|---|---|---| +| **MxGateway** | Both | MEL → Serilog migration: `GatewayLogScope`/`BeginScope` → `LogContext.PushProperty`; `GatewayLogRedactor` → `ILogRedactor` impl; `GatewayMetrics` stays, wired through `o.Meters`. **Done in this release.** | +| **OtOpcUa** | Both | Consolidate existing Serilog bootstrap; add `TraceContextEnricher` + `SiteId`/`NodeRole` enrichers; add Resource to existing OTel pipeline. Deferred to GAPS backlog. | +| **ScadaBridge** | Both | Add full OTel SDK (metrics + traces + export); consolidate `LoggerConfigurationFactory`; add `TraceContextEnricher`. Deferred to GAPS backlog. | + +The net48 x86 mxaccessgw worker is excluded from both packages. Its `IWorkerLogger` +(stderr key=value format) is an out-of-process concern and remains bespoke. + +--- + +## Open contract questions + +1. **`IServiceCollection` overload completeness:** the `IHostApplicationBuilder`-based + overload is the primary path (available in all three apps on .NET 10). The + `IServiceCollection` overload is a fallback for unusual host configurations. Validate + that both overloads wire OTel log export identically (same Resource, same enrichers). + +2. **OTel log export channel:** `AddZbSerilog` uses `Serilog.Sinks.OpenTelemetry` to push + logs into the OTel pipeline (sharing the Resource). Confirm the sink version is + compatible with the OpenTelemetry SDK version pinned in `ZB.MOM.WW.Telemetry` + (`Directory.Packages.props`). + +3. **`RedactionEnricher` DI timing:** `RedactionEnricher` resolves `ILogRedactor` from + `IServiceProvider` on first use (lazy, to avoid a circular-DI problem during Serilog's + two-stage bootstrap). Validate that the service provider is fully built by the time the + first post-startup log event fires. If MxGateway's `GatewayLogRedactor` has dependencies + that are not available at stage-1 bootstrap time, the lazy-resolve pattern protects it. + +4. **`SiteId` / `NodeRole` null handling:** `AddZbTelemetry` and `AddZbSerilog` silently + omit null `SiteId`/`NodeRole` from the Resource and enricher set. Confirm this is the + correct behavior for OtOpcUa, which may run in a single-site configuration where neither + field is meaningful, versus ScadaBridge, where `SiteId` is essential for multi-cluster + fleet visibility. + +See [`../GAPS.md`](../GAPS.md) for the adoption order and effort/risk. diff --git a/components/observability/spec/METRIC-CONVENTIONS.md b/components/observability/spec/METRIC-CONVENTIONS.md new file mode 100644 index 0000000..07cfd2a --- /dev/null +++ b/components/observability/spec/METRIC-CONVENTIONS.md @@ -0,0 +1,224 @@ +# Observability — Metric conventions (standardized) + +Status: **Standardized**. The naming and unit rules every sister project's instruments must +follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md) +for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md) +for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md). + +The per-project instrument tables below (§4) document the **existing bespoke surface** — the +instruments each app currently defines or intends to define. These stay per-project; they are +not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must +be named and measured. + +--- + +## 1. Meter name + +Each app owns exactly **one primary Meter**, named after its root namespace: + +| App | Meter name | Status | +|---|---|---| +| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today | +| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption | +| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) | + +`MxGateway.Server` is the single convergence item for meter naming. It predates the +`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments +emitted under the old name will require a `recording_rule` or relabel in any Prometheus +config that already scrapes the snapshot — coordinate before renaming in production. + +If an app has secondary meters (e.g. a library component with its own meter), those follow +the same pattern: `ZB.MOM.WW..`. + +--- + +## 2. Instrument name + +Instrument names follow the pattern `..`, all lower-case, +dot-separated: + +``` + := short app identifier — otopcua | mxgateway | scadabridge + := functional area — deploy | session | tag | alarm | gateway | worker | ... + := what happened or is measured — applied | count | duration | errors | active | ... +``` + +**Examples:** + +| Instrument name | App | Meaning | +|---|---|---| +| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space | +| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions | +| `mxgateway.session.active` | MxGateway | Active MxAccess sessions | +| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker | +| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL | + +**Rules:** + +1. All lower-case. No camelCase, no PascalCase, no hyphens. +2. Three segments minimum (`..`). Four are permitted when the + subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`). +3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`, + `duration`), not implementation details (`method_called`, `loop_iteration`). +4. Counters: past-tense or noun (`received`, `errors`, `applied`). + UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`). + Histograms: `duration` or a measured quantity noun (`size`, `lag`). + +--- + +## 3. Units + +### Duration — seconds (mandatory) + +**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic +convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks +aggregations across apps. + +> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit +> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated +> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site). +> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that +> reads these histograms in `ms` will need updating. Until migration is complete, annotate +> the instruments with `// CONVERGENCE: ms→s pending`. + +### Other units + +| Quantity | Unit string | Notes | +|---|---|---| +| Duration | `"s"` | Mandatory — see above | +| Size / bytes | `"By"` | UCUM bytes | +| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred | +| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts | + +--- + +## 4. Resource attribute set (shared across all three signals) + +The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and +attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values +populate Serilog enrichers, making a metric, a span, and a log line from the same node +joinable in any OTel-compatible backend. + +| OTel attribute | Type | Required | Notes | +|---|---|---|---| +| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` | +| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override | +| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong | +| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments | +| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` | +| `host.name` | string | Auto | Always `Environment.MachineName`; never override | + +**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one +central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a +site node and the central node are indistinguishable even if `host.name` differs. + +--- + +## 5. Standard instrumentation baseline + +Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community- +standard instrumentation packages; the overhead is negligible and the benefit (correlated +HTTP / gRPC request traces across the fleet) is high. + +| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today | +|---|---|---|---|---| +| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — | +| HttpClient | Traces + Metrics | ✅ | ✅ | — | +| gRPC client | Traces | ✅ | — | — | +| .NET runtime | Metrics | ✅ | — | — | +| Process | Metrics | ✅ | — | — | + +OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through +`AddZbTelemetry`. No project removes any of these. + +--- + +## 6. Per-app instrument surface (bespoke — stays per project) + +These instruments are **not part of the shared library**. They document the existing bespoke +surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`. + +### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter + +Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs` + +| Instrument | Kind | Unit | Description | +|---|---|---|---| +| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space | +| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing | +| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions | +| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations | +| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations | +| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions | +| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway | + +**ActivitySources (spans):** + +| Source name | Span(s) | +|---|---| +| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` | + +All durations already use `"s"` — no convergence item for OtOpcUa. + +### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`) + +Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs` + +**Counters (13):** + +| Instrument | Unit | Description | +|---|---|---| +| `mxgateway.session.created` | `"1"` | MxAccess sessions opened | +| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed | +| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors | +| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations | +| `mxgateway.command.errors` | `"1"` | Command invocation errors | +| `mxgateway.event.received` | `"1"` | MxAccess events received from worker | +| `mxgateway.event.errors` | `"1"` | Event processing errors | +| `mxgateway.worker.started` | `"1"` | x86 worker processes started | +| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped | +| `mxgateway.worker.errors` | `"1"` | Worker communication errors | +| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs | +| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors | +| `mxgateway.auth.failures` | `"1"` | Authentication failures | + +**Histograms (3):** + +| Instrument | Unit | Current unit | Convergence | +|---|---|---|---| +| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | +| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | +| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption | + +**Gauges (4):** + +| Instrument | Unit | Description | +|---|---|---| +| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions | +| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes | +| `mxgateway.worker.memory` | `"By"` | Worker process RSS | +| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache | + +No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource +is left per-project (deferred to GAPS backlog). + +### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter + +No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target +meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the +ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md). + +--- + +## Consequences and convergence items (accepted) + +| Item | Scope | Severity | +|---|---|---| +| MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards | +| MxGateway histogram unit `ms` → `s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating | +| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch | + +All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s +migration is the highest-priority convergence item because leaving it unresolved means +MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana +workspace. diff --git a/components/observability/spec/SPEC.md b/components/observability/spec/SPEC.md new file mode 100644 index 0000000..e5f460a --- /dev/null +++ b/components/observability/spec/SPEC.md @@ -0,0 +1,177 @@ +# Observability — normalized target spec + +Status: **Draft**. The single design the sister projects converge on. Derived from the +three code-verified current-state docs (`../current-state/`). Goal is *path to shared code* +(`../shared-contract/ZB.MOM.WW.Telemetry.md`), so each normalized section maps to a shared +library seam. + +## 0. Scope + +**Normalized here:** one OpenTelemetry bootstrap across all three signals (metrics + traces + +logs) via a single `AddZbTelemetry` extension; the shared `Resource` attribute set +(`service.name` / `service.namespace` / `service.version` / `site.id` / `node.role` / +`host.name`) that makes every node distinguishable in a collector; standard instrumentation +everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter +conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap +with identity enrichers (`SiteId`, `NodeRole`, `NodeHostname`) bound from the same options +object as the OTel Resource (metrics and logs therefore carry identical dimensions); a +`TraceContextEnricher` that stamps `trace_id`/`span_id` from `Activity.Current` onto every +Serilog event, enabling log↔trace correlation; an `ILogRedactor` redaction seam. + +**Explicitly NOT normalized** (domain-specific — keep per project): each app's actual +instruments — `otopcua.*` meters and spans, `mxgateway.*` counters/histograms/gauges — they +are registered *through* the shared bootstrap but their names and semantics remain +bespoke (see [`METRIC-CONVENTIONS.md`](METRIC-CONVENTIONS.md) §4); the redaction *policy* +(which field names, which command types) — only the `ILogRedactor` seam is shared, each +project supplies its own implementation; the MxGateway net48 x86 worker's `IWorkerLogger` +(stderr key=value format, out-of-process, out of scope). + +## 1. OpenTelemetry pipeline — `AddZbTelemetry` + +A single `IHostApplicationBuilder` extension is the front door for all three OTel signals. +It wires the shared `Resource`, registers standard instrumentation, and configures the +selected exporter: + +```csharp +builder.AddZbTelemetry(o => +{ + o.ServiceName = "mxgateway"; // populates Resource service.name + o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet (default) + o.ServiceVersion = "1.0.0"; // populated from AssemblyInformationalVersion + o.SiteId = cfg.SiteId; // Resource site.id + Serilog SiteId property + o.NodeRole = cfg.NodeRole; // Resource node.role + Serilog NodeRole property + o.Meters = ["MxGateway.Server"]; // app's own Meter name(s) + o.ActivitySources = ["MxGateway.Server"]; // app's own ActivitySource name(s) + o.Exporter = ZbExporter.Prometheus; // default; ZbExporter.Otlp opt-in + // o.OtlpEndpoint = "http://collector:4317"; // required when Exporter = Otlp +}); + +app.MapZbMetrics(); // mounts Prometheus /metrics scrape endpoint +``` + +This is the headline fix: nobody in the fleet sets a `Resource` or `service.name` today, +making every node indistinguishable in a collector. Every project must call `AddZbTelemetry` +to be observable. + +## 2. Shared Resource + +The OTel `Resource` attached to all three signals is built from `ZbTelemetryOptions`: + +| OTel attribute | Options property | Notes | +|---|---|---| +| `service.name` | `ServiceName` | Required. Lower-case short identifier (`otopcua`, `mxgateway`, `scadabridge`) | +| `service.namespace` | `ServiceNamespace` | Default `"ZB.MOM.WW"` — constant across the fleet | +| `service.version` | `ServiceVersion` | Optional; recommend populating from `AssemblyInformationalVersion` | +| `site.id` | `SiteId` | Optional; identifies the physical/logical site | +| `node.role` | `NodeRole` | Optional; e.g. `"central"`, `"site"`, `"hub"` | +| `host.name` | _(auto)_ | Always populated from `Environment.MachineName` | + +The same `SiteId` and `NodeRole` values are passed to the Serilog enrichers (§4) so a +metric, a span, and a log line from the same node carry identical dimensions and join up in +any OTel-compatible backend. + +## 3. Standard instrumentation + +`AddZbTelemetry` enables the following instrumentation for all projects. Any project that +already enables a subset gets it consolidated; no project may skip this baseline: + +| Instrumentation | Package | Signal | +|---|---|---| +| ASP.NET Core | `OpenTelemetry.Instrumentation.AspNetCore` | Traces + Metrics | +| HttpClient | `OpenTelemetry.Instrumentation.Http` | Traces + Metrics | +| gRPC client | `OpenTelemetry.Instrumentation.GrpcNetClient` | Traces | +| .NET runtime | `OpenTelemetry.Instrumentation.Runtime` | Metrics | +| Process | `OpenTelemetry.Instrumentation.Process` | Metrics | + +App-specific `Meter` names and `ActivitySource` names are registered via `o.Meters` and +`o.ActivitySources`. This is how MxGateway's hand-rolled `GatewayMetrics` finally gets an +export path instead of dying in an in-memory `GetSnapshot()`. + +## 4. Exporter conventions + +`ZbTelemetryOptions.Exporter` selects the export path: + +| Value | Behaviour | +|---|---| +| `ZbExporter.Prometheus` | Mounts a Prometheus `/metrics` scrape endpoint via `app.MapZbMetrics()`. Default for all three apps — consistent with OtOpcUa's existing `/metrics`. | +| `ZbExporter.Otlp` | Exports to an OTLP endpoint specified by `o.OtlpEndpoint` (gRPC, `http://collector:4317`). Opt-in path to a real OTel Collector; coexists with Prometheus. | + +Both exporters carry the shared `Resource`. OTLP is the path to a real backend (Tempo, +Prometheus-remote-write, Loki); Prometheus covers the "scrape from the node" case that all +three apps currently use or aspire to. + +## 5. Serilog logging stack + +`AddZbSerilog` is a companion extension in the `.Serilog` package. It replaces each +project's bespoke logging bootstrap with a shared two-stage pattern: + +**Stage 1 (bootstrap logger):** a minimal `Log.Logger` for startup errors before the +`IConfiguration` is available. Writes to console only. + +**Stage 2 (application logger):** reads sinks and overrides from `IConfiguration` +(`ReadFrom.Configuration`) and applies a set of fixed enrichers: + +| Enricher | Property name | Source | +|---|---|---| +| `ZbLogEnricherNames.SiteId` | `"SiteId"` | `ZbTelemetryOptions.SiteId` | +| `ZbLogEnricherNames.NodeRole` | `"NodeRole"` | `ZbTelemetryOptions.NodeRole` | +| `ZbLogEnricherNames.NodeHostname` | `"NodeHostname"` | `Environment.MachineName` | +| `TraceContextEnricher` | `trace_id`, `span_id` | `Activity.Current` | +| `RedactionEnricher` | _(project-defined fields)_ | `ILogRedactor` implementation | + +The three identity properties (`SiteId`, `NodeRole`, `NodeHostname`) are bound from the +same `ZbTelemetryOptions` object as the OTel `Resource`, so logs and metrics/traces carry +identical dimensions. When no `Activity.Current` is present (e.g. background services, +startup), `TraceContextEnricher` emits nothing — it does not inject empty or zero values. + +`MinimumLevel` is set explicitly in code (default `Information`) and can be overridden via +`IConfiguration` (`Serilog:MinimumLevel`). Sinks are fully config-driven: +`ReadFrom.Configuration` reads `Serilog:WriteTo` from `appsettings.json` / environment. + +OTel log export is wired in the same call: logs flow through the OTel pipeline with the +same `Resource` attached, making all three signals (metrics / traces / logs) available in a +single backend. + +## 6. Redaction seam — `ILogRedactor` + +`ILogRedactor` is a single-method interface that receives the mutable log-event property +dictionary and scrubs any fields that must not leave the process: + +```csharp +public interface ILogRedactor +{ + void Redact(IDictionary properties); +} +``` + +`RedactionEnricher` applies a registered `ILogRedactor` on every log event. The seam is +shared; the **policy** is per-project (which field names, which command types, which +classification levels). MxGateway's existing `GatewayLogRedactor` is the reference +implementation; it migrates to this seam during adoption. If no `ILogRedactor` is +registered, `RedactionEnricher` is a no-op. + +This preserves the operational property MxGateway already has (secrets never leave the +process in log events) while making the plumbing reusable. + +## 7. Per-project migration + +| Project | Current state | Primary gaps | What normalizes | +|---|---|---|---| +| **OtOpcUa** | Full OTel SDK (`WithMetrics` + `WithTracing`); Prometheus `/metrics`; Serilog bootstrap; 7 instruments + 2 spans. | No `Resource` / `service.name` anywhere; no trace↔log correlation; no `SiteId`/`NodeRole` enrichers. | Call `AddZbTelemetry` (adds Resource; consolidates standard instrumentation); call `AddZbSerilog` (adds `TraceContextEnricher` + identity enrichers); migrate existing Serilog bootstrap to shared two-stage pattern. | +| **MxGateway** | Hand-rolled `GatewayMetrics` (13 counters / 3 histograms `ms` / 4 gauges); in-memory snapshot only — no export; MEL logging with `GatewayLogScope` correlation + `GatewayLogRedactor`; no OTel SDK. | No OTel SDK; no export; `ms` histograms diverge from OTel semconv (`s`); MEL → Serilog migration; no Resource. | Call `AddZbTelemetry` (wires OTel SDK around existing `GatewayMetrics` — finally exports); call `AddZbSerilog` (replaces MEL; re-expresses `GatewayLogScope` as `LogContext.PushProperty`; moves `GatewayLogRedactor` behind `ILogRedactor`). Duration unit convergence (`ms`→`s`) tracked in GAPS. **This is the one adoption done now.** | +| **ScadaBridge** | `OpenTelemetry.Api` ref only (dangling — CVE-patch origin, zero usage); Serilog bootstrap (`LoggerConfigurationFactory`) with `SiteId`/`NodeRole`/`NodeHostname` enrichers. | No OTel SDK; no metrics; no tracing; no export; no trace↔log correlation. ScadaBridge's enricher property names are already the target names — migration is additive. | Call `AddZbTelemetry` (adds OTel SDK + metrics + traces + export); call `AddZbSerilog` (consolidates `LoggerConfigurationFactory`; adds `TraceContextEnricher`). | + +> The MxGateway logging migration (`MEL → Serilog`, re-expressing `GatewayLogRedactor` +> behind `ILogRedactor`) is the **only sister-repo touch in scope for this release**. OtOpcUa +> and ScadaBridge adoption is deferred to the follow-on tracked in +> [`../GAPS.md`](../GAPS.md). + +## 8. Acceptance (what "converged" means) + +A project is converged when: (a) it calls `builder.AddZbTelemetry(o => ...)` with all +required Resource attributes populated; (b) it calls `app.MapZbMetrics()` (or configures +OTLP); (c) it calls `builder.AddZbSerilog(...)` and the `TraceContextEnricher` stamps +`trace_id`/`span_id` on every log event emitted under an active `Activity`; (d) its +`ILogRedactor` implementation (if applicable) is registered and applied by `RedactionEnricher`; +(e) every node in the fleet is distinguishable by `service.name` + `site.id` + `node.role` +in a collector or log aggregator.