215a646e35
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).
C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).
C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.
m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.
I4: §5 standard instrumentation table corrected — OtOpcUa now shows ⛔ not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.
I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).
I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.
I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.
m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
191 lines
12 KiB
Markdown
191 lines
12 KiB
Markdown
# Observability — normalized target spec
|
|
|
|
Status: **Draft**. The single design the sister projects converge on. Derived from the
|
|
three code-verified current-state docs (`../current-state/`). Goal is *path to shared code*
|
|
(`../shared-contract/ZB.MOM.WW.Telemetry.md`), so each normalized section maps to a shared
|
|
library seam.
|
|
|
|
## 0. Scope
|
|
|
|
**Normalized here:** one OpenTelemetry bootstrap across all three signals (metrics + traces +
|
|
logs) via a single `AddZbTelemetry` extension; the shared `Resource` attribute set
|
|
(`service.name` / `service.namespace` / `service.version` / `site.id` / `node.role` /
|
|
`host.name`) that makes every node distinguishable in a collector; standard instrumentation
|
|
everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter
|
|
conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap
|
|
with identity enrichers (`SiteId`, `NodeRole` from `ZbTelemetryOptions`; `NodeHostname` auto
|
|
from `Environment.MachineName`) matching the OTel Resource dimensions (metrics and logs
|
|
therefore carry identical dimensions); a
|
|
`TraceContextEnricher` that stamps `trace_id`/`span_id` from `Activity.Current` onto every
|
|
Serilog event, enabling log↔trace correlation; an `ILogRedactor` redaction seam.
|
|
|
|
**Explicitly NOT normalized** (domain-specific — keep per project): each app's actual
|
|
instruments — `otopcua.*` meters and spans, `mxgateway.*` counters/histograms/gauges — they
|
|
are registered *through* the shared bootstrap but their names and semantics remain
|
|
bespoke (see [`METRIC-CONVENTIONS.md`](METRIC-CONVENTIONS.md) §4); the redaction *policy*
|
|
(which field names, which command types) — only the `ILogRedactor` seam is shared, each
|
|
project supplies its own implementation; the MxGateway net48 x86 worker's `IWorkerLogger`
|
|
(stderr key=value format, out-of-process, out of scope).
|
|
|
|
## 1. OpenTelemetry pipeline — `AddZbTelemetry`
|
|
|
|
A single `IHostApplicationBuilder` extension is the front door for all three OTel signals.
|
|
It wires the shared `Resource`, registers standard instrumentation, and configures the
|
|
selected exporter:
|
|
|
|
```csharp
|
|
builder.AddZbTelemetry(o =>
|
|
{
|
|
o.ServiceName = "mxgateway"; // populates Resource service.name
|
|
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet (default)
|
|
o.ServiceVersion = "1.0.0"; // populated from AssemblyInformationalVersion
|
|
o.SiteId = cfg.SiteId; // Resource site.id + Serilog SiteId property
|
|
o.NodeRole = cfg.NodeRole; // Resource node.role + Serilog NodeRole property
|
|
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
|
|
o.ActivitySources = ["MxGateway.Server"]; // app's own ActivitySource name(s)
|
|
o.Exporter = ZbExporter.Prometheus; // default; ZbExporter.Otlp opt-in
|
|
// o.OtlpEndpoint = "http://collector:4317"; // required when Exporter = Otlp
|
|
});
|
|
|
|
app.MapZbMetrics(); // mounts Prometheus /metrics scrape endpoint
|
|
```
|
|
|
|
This is the headline fix: nobody in the fleet sets a `Resource` or `service.name` today,
|
|
making every node indistinguishable in a collector. Every project must call `AddZbTelemetry`
|
|
to be observable.
|
|
|
|
> **`IServiceCollection` overload:** `AddZbTelemetry` also has an `IServiceCollection`-based
|
|
> overload for host configurations where `IHostApplicationBuilder` is not available (detailed in
|
|
> the shared-contract). The `IHostApplicationBuilder` overload is the primary path for all three
|
|
> apps on .NET 10.
|
|
|
|
## 2. Shared Resource
|
|
|
|
The OTel `Resource` attached to all three signals is built from `ZbTelemetryOptions`:
|
|
|
|
| OTel attribute | Options property | Notes |
|
|
|---|---|---|
|
|
| `service.name` | `ServiceName` | Required. Lower-case short identifier (`otopcua`, `mxgateway`, `scadabridge`) |
|
|
| `service.namespace` | `ServiceNamespace` | Default `"ZB.MOM.WW"` — constant across the fleet |
|
|
| `service.version` | `ServiceVersion` | Optional; recommend populating from `AssemblyInformationalVersion` |
|
|
| `site.id` | `SiteId` | Optional; identifies the physical/logical site |
|
|
| `node.role` | `NodeRole` | Optional; e.g. `"central"`, `"site"`, `"hub"` |
|
|
| `host.name` | _(auto)_ | Always populated from `Environment.MachineName` |
|
|
|
|
The same `SiteId` and `NodeRole` values are passed to the Serilog enrichers (§4) so a
|
|
metric, a span, and a log line from the same node carry identical dimensions and join up in
|
|
any OTel-compatible backend.
|
|
|
|
## 3. Standard instrumentation
|
|
|
|
`AddZbTelemetry` enables the following instrumentation for all projects. Any project that
|
|
already enables a subset gets it consolidated; no project may skip this baseline:
|
|
|
|
| Instrumentation | Package | Signal |
|
|
|---|---|---|
|
|
| ASP.NET Core | `OpenTelemetry.Instrumentation.AspNetCore` | Traces + Metrics |
|
|
| HttpClient | `OpenTelemetry.Instrumentation.Http` | Traces + Metrics |
|
|
| gRPC client | `OpenTelemetry.Instrumentation.GrpcNetClient` | Traces |
|
|
| .NET runtime | `OpenTelemetry.Instrumentation.Runtime` | Metrics |
|
|
| Process | `OpenTelemetry.Instrumentation.Process` | Metrics |
|
|
|
|
App-specific `Meter` names and `ActivitySource` names are registered via `o.Meters` and
|
|
`o.ActivitySources`. This is how MxGateway's hand-rolled `GatewayMetrics` finally gets an
|
|
export path instead of dying in an in-memory `GetSnapshot()`.
|
|
|
|
## 4. Exporter conventions
|
|
|
|
`ZbTelemetryOptions.Exporter` selects the export path:
|
|
|
|
| Value | Behaviour |
|
|
|---|---|
|
|
| `ZbExporter.Prometheus` | Mounts a Prometheus `/metrics` scrape endpoint via `app.MapZbMetrics()`. Default for all three apps — consistent with OtOpcUa's existing `/metrics`. |
|
|
| `ZbExporter.Otlp` | Exports to an OTLP endpoint specified by `o.OtlpEndpoint` (gRPC, `http://collector:4317`). Opt-in path to a real OTel Collector; coexists with Prometheus. |
|
|
|
|
Both exporters carry the shared `Resource`. OTLP is the path to a real backend (Tempo,
|
|
Prometheus-remote-write, Loki); Prometheus covers the "scrape from the node" case that all
|
|
three apps currently use or aspire to.
|
|
|
|
## 5. Serilog logging stack
|
|
|
|
`AddZbSerilog` is a companion extension in the `.Serilog` package. It replaces each
|
|
project's bespoke logging bootstrap with a shared two-stage pattern:
|
|
|
|
**Stage 1 (bootstrap logger):** a minimal `Log.Logger` for startup errors before the
|
|
`IConfiguration` is available. Writes to console only.
|
|
|
|
**Stage 2 (application logger):** reads sinks and overrides from `IConfiguration`
|
|
(`ReadFrom.Configuration`) and applies a set of fixed enrichers:
|
|
|
|
| Enricher | Property name | Source |
|
|
|---|---|---|
|
|
| `ZbLogEnricherNames.SiteId` | `"SiteId"` | `ZbTelemetryOptions.SiteId` |
|
|
| `ZbLogEnricherNames.NodeRole` | `"NodeRole"` | `ZbTelemetryOptions.NodeRole` |
|
|
| `ZbLogEnricherNames.NodeHostname` | `"NodeHostname"` | `Environment.MachineName` |
|
|
| `TraceContextEnricher` | `trace_id`, `span_id` | `Activity.Current` |
|
|
| `RedactionEnricher` | _(project-defined fields)_ | `ILogRedactor` implementation |
|
|
|
|
`SiteId` and `NodeRole` are bound from the same `ZbTelemetryOptions` object as the OTel
|
|
`Resource`; `NodeHostname` is populated automatically from `Environment.MachineName` (not a
|
|
caller-supplied option). All three identity properties appear on logs and metrics/traces alike,
|
|
so signals from the same node carry identical dimensions. When no `Activity.Current` is present (e.g. background services,
|
|
startup), `TraceContextEnricher` emits nothing — it does not inject empty or zero values.
|
|
|
|
`MinimumLevel` is set explicitly in code (default `Information`) and can be overridden via
|
|
`IConfiguration` (`Serilog:MinimumLevel`). Sinks are fully config-driven:
|
|
`ReadFrom.Configuration` reads `Serilog:WriteTo` from `appsettings.json` / environment.
|
|
|
|
> **Per-project config paths:** `AddZbSerilog` reads `Serilog:MinimumLevel` from `IConfiguration`.
|
|
> Callers that bind MinimumLevel from a different key (e.g. ScadaBridge's
|
|
> `ScadaBridge:Logging:MinimumLevel`) apply that override themselves before or after calling
|
|
> `AddZbSerilog`. The config key for MinimumLevel remains per-project; `AddZbSerilog` is not
|
|
> parameterized on it.
|
|
|
|
OTel log export is wired in the same call: logs flow through the OTel pipeline with the
|
|
same `Resource` attached, making all three signals (metrics / traces / logs) available in a
|
|
single backend.
|
|
|
|
## 6. Redaction seam — `ILogRedactor`
|
|
|
|
`ILogRedactor` is a single-method interface that receives the mutable log-event property
|
|
dictionary and scrubs any fields that must not leave the process:
|
|
|
|
```csharp
|
|
public interface ILogRedactor
|
|
{
|
|
void Redact(IDictionary<string, object?> properties);
|
|
}
|
|
```
|
|
|
|
`RedactionEnricher` applies a registered `ILogRedactor` on every log event. The seam is
|
|
shared; the **policy** is per-project (which field names, which command types, which
|
|
classification levels). MxGateway's existing `GatewayLogRedactor` is the reference
|
|
implementation; it migrates to this seam during adoption. If no `ILogRedactor` is
|
|
registered, `RedactionEnricher` is a no-op.
|
|
|
|
This preserves the operational property MxGateway already has (secrets never leave the
|
|
process in log events) while making the plumbing reusable.
|
|
|
|
## 7. Per-project migration
|
|
|
|
| Project | Current state | Primary gaps | What normalizes |
|
|
|---|---|---|---|
|
|
| **OtOpcUa** | Full OTel SDK (`WithMetrics` + `WithTracing`); Prometheus `/metrics`; Serilog bootstrap; 7 instruments + 2 spans. | No `Resource` / `service.name` anywhere; no trace↔log correlation; no `SiteId`/`NodeRole` enrichers. | Call `AddZbTelemetry` (adds Resource; consolidates standard instrumentation); call `AddZbSerilog` (adds `TraceContextEnricher` + identity enrichers); migrate existing Serilog bootstrap to shared two-stage pattern. |
|
|
| **MxGateway** | Hand-rolled `GatewayMetrics` (13 counters / 3 histograms `ms` / 4 gauges); in-memory snapshot only — no export; MEL logging with `GatewayLogScope` correlation + `GatewayLogRedactor`; no OTel SDK. | No OTel SDK; no export; `ms` histograms diverge from OTel semconv (`s`); MEL → Serilog migration; no Resource. | Call `AddZbTelemetry` (wires OTel SDK around existing `GatewayMetrics` — finally exports); call `AddZbSerilog` (replaces MEL; re-expresses `GatewayLogScope` as `LogContext.PushProperty`; moves `GatewayLogRedactor` behind `ILogRedactor`). Duration unit convergence (`ms`→`s`) tracked in GAPS. **This is the one adoption done now.** |
|
|
| **ScadaBridge** | `OpenTelemetry.Api` ref only (dangling — CVE-patch origin, zero usage); Serilog bootstrap (`LoggerConfigurationFactory`) with `SiteId`/`NodeRole`/`NodeHostname` enrichers. | No OTel SDK; no metrics; no tracing; no export; no trace↔log correlation. ScadaBridge's enricher property names are already the target names — migration is additive. | Call `AddZbTelemetry` (adds OTel SDK + metrics + traces + export); call `AddZbSerilog` (consolidates `LoggerConfigurationFactory`; adds `TraceContextEnricher`). |
|
|
|
|
> The MxGateway logging migration (`MEL → Serilog`, re-expressing `GatewayLogRedactor`
|
|
> behind `ILogRedactor`) is the **only sister-repo touch in scope for this release**. OtOpcUa
|
|
> and ScadaBridge adoption is deferred to the follow-on tracked in
|
|
> [`../GAPS.md`](../GAPS.md).
|
|
|
|
## 8. Acceptance (what "converged" means)
|
|
|
|
A project is converged when: (a) it calls `builder.AddZbTelemetry(o => ...)` with all
|
|
required Resource attributes populated; (b) it calls `app.MapZbMetrics()` (or configures
|
|
OTLP); (c) it calls `builder.AddZbSerilog(...)` and the `TraceContextEnricher` stamps
|
|
`trace_id`/`span_id` on every log event emitted under an active `Activity`; (d) its
|
|
`ILogRedactor` implementation (if applicable) is registered and applied by `RedactionEnricher`;
|
|
(e) every node in the fleet is distinguishable by `service.name` + `site.id` + `node.role`
|
|
in a collector or log aggregator.
|