docs(observability): spec + METRIC-CONVENTIONS + ZB.MOM.WW.Telemetry shared contract
Author the three normalization docs for the observability component: - components/observability/spec/SPEC.md — Section 0 scope (normalized vs. per-project), AddZbTelemetry pipeline, shared Resource attribute set, standard instrumentation baseline, exporter conventions, Serilog two-stage bootstrap with identity enrichers and TraceContextEnricher, ILogRedactor redaction seam, per-project migration table, and acceptance criteria. - components/observability/spec/METRIC-CONVENTIONS.md — meter naming convention (app namespace; MxGateway.Server flagged as convergence target), instrument naming pattern (<app>.<subsystem>.<event>), mandatory duration unit = seconds (MxGateway ms histograms flagged), Resource attribute set table, standard instrumentation baseline, and per-app instrument tables (OtOpcUa 7 instruments + 2 spans; MxGateway 13 counters / 3 histograms / 4 gauges; ScadaBridge TBD). - components/observability/shared-contract/ZB.MOM.WW.Telemetry.md — paper API for the two packages: ZbTelemetryOptions, ZbExporter enum, AddZbTelemetry (IHostApplicationBuilder + IServiceCollection overloads), ZbResource.Build, MapZbMetrics; AddZbSerilog, ZbLogEnricherNames constants, TraceContextEnricher, ILogRedactor, RedactionEnricher. Consumer matrix and open contract questions included.
This commit is contained in:
@@ -0,0 +1,224 @@
|
||||
# Observability — Metric conventions (standardized)
|
||||
|
||||
Status: **Standardized**. The naming and unit rules every sister project's instruments must
|
||||
follow. Analogous to [`../auth/spec/CANONICAL-ROLES.md`](../../auth/spec/CANONICAL-ROLES.md)
|
||||
for auth and [`../ui-theme/spec/DESIGN-TOKENS.md`](../../ui-theme/spec/DESIGN-TOKENS.md)
|
||||
for the UI kit. Authoritative alongside [`SPEC.md`](SPEC.md).
|
||||
|
||||
The per-project instrument tables below (§4) document the **existing bespoke surface** — the
|
||||
instruments each app currently defines or intends to define. These stay per-project; they are
|
||||
not candidates for the shared library. The rules in §1–§3 govern *how* those instruments must
|
||||
be named and measured.
|
||||
|
||||
---
|
||||
|
||||
## 1. Meter name
|
||||
|
||||
Each app owns exactly **one primary Meter**, named after its root namespace:
|
||||
|
||||
| App | Meter name | Status |
|
||||
|---|---|---|
|
||||
| OtOpcUa | `ZB.MOM.WW.OtOpcUa` | Correct today |
|
||||
| MxGateway | `MxGateway.Server` | ⚠ Convergence target — rename to `ZB.MOM.WW.MxGateway` on adoption |
|
||||
| ScadaBridge | `ZB.MOM.WW.ScadaBridge` | Target (no meter exists today) |
|
||||
|
||||
`MxGateway.Server` is the single convergence item for meter naming. It predates the
|
||||
`ZB.MOM.WW.*` namespace convention; rename when adopting `AddZbTelemetry`. Instruments
|
||||
emitted under the old name will require a `recording_rule` or relabel in any Prometheus
|
||||
config that already scrapes the snapshot — coordinate before renaming in production.
|
||||
|
||||
If an app has secondary meters (e.g. a library component with its own meter), those follow
|
||||
the same pattern: `ZB.MOM.WW.<App>.<Component>`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Instrument name
|
||||
|
||||
Instrument names follow the pattern `<app>.<subsystem>.<event>`, all lower-case,
|
||||
dot-separated:
|
||||
|
||||
```
|
||||
<app> := short app identifier — otopcua | mxgateway | scadabridge
|
||||
<subsystem> := functional area — deploy | session | tag | alarm | gateway | worker | ...
|
||||
<event> := what happened or is measured — applied | count | duration | errors | active | ...
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
| Instrument name | App | Meaning |
|
||||
|---|---|---|
|
||||
| `otopcua.deploy.applied` | OtOpcUa | Galaxy deploy events applied to the address space |
|
||||
| `otopcua.tag.subscriptions` | OtOpcUa | Active OPC UA tag subscriptions |
|
||||
| `mxgateway.session.active` | MxGateway | Active MxAccess sessions |
|
||||
| `mxgateway.worker.call.duration` | MxGateway | gRPC call duration to the x86 worker |
|
||||
| `scadabridge.alarm.received` | ScadaBridge | Alarms received by the DCL |
|
||||
|
||||
**Rules:**
|
||||
|
||||
1. All lower-case. No camelCase, no PascalCase, no hyphens.
|
||||
2. Three segments minimum (`<app>.<subsystem>.<event>`). Four are permitted when the
|
||||
subsystem warrants a sub-area (e.g. `mxgateway.worker.call.duration`).
|
||||
3. Event nouns describe **what is counted or measured** (`applied`, `errors`, `active`,
|
||||
`duration`), not implementation details (`method_called`, `loop_iteration`).
|
||||
4. Counters: past-tense or noun (`received`, `errors`, `applied`).
|
||||
UpDownCounters / gauges: present-state noun or adjective (`active`, `connected`).
|
||||
Histograms: `duration` or a measured quantity noun (`size`, `lag`).
|
||||
|
||||
---
|
||||
|
||||
## 3. Units
|
||||
|
||||
### Duration — seconds (mandatory)
|
||||
|
||||
**All duration histograms MUST use seconds** (`"s"`). This is the OpenTelemetry semantic
|
||||
convention (`UCUM`: `s`). Backends and dashboards assume seconds; mixing units breaks
|
||||
aggregations across apps.
|
||||
|
||||
> ⚠ **MxGateway convergence item:** `GatewayMetrics.cs` defines three histograms with unit
|
||||
> `"ms"` (`CommandDuration`, `EventDuration`, `WorkerCallDuration`). These must be migrated
|
||||
> to `"s"` on adoption. Values must also be converted (divide by 1 000 at the call site).
|
||||
> Track existing Prometheus `recording_rule`/dashboard changes — any dashboard panel that
|
||||
> reads these histograms in `ms` will need updating. Until migration is complete, annotate
|
||||
> the instruments with `// CONVERGENCE: ms→s pending`.
|
||||
|
||||
### Other units
|
||||
|
||||
| Quantity | Unit string | Notes |
|
||||
|---|---|---|
|
||||
| Duration | `"s"` | Mandatory — see above |
|
||||
| Size / bytes | `"By"` | UCUM bytes |
|
||||
| Count (dimensionless) | `"1"` or omit | For pure event counts; `"1"` preferred |
|
||||
| Messages, requests | `"{message}"`, `"{request}"` | UCUM annotation form for dimensioned counts |
|
||||
|
||||
---
|
||||
|
||||
## 4. Resource attribute set (shared across all three signals)
|
||||
|
||||
The OTel `Resource` is built once by `AddZbTelemetry` (see [`SPEC.md`](SPEC.md) §2) and
|
||||
attached to metrics, traces, and OTel-exported logs. The same `SiteId` and `NodeRole` values
|
||||
populate Serilog enrichers, making a metric, a span, and a log line from the same node
|
||||
joinable in any OTel-compatible backend.
|
||||
|
||||
| OTel attribute | Type | Required | Notes |
|
||||
|---|---|---|---|
|
||||
| `service.name` | string | Yes | Short lower-case app id: `otopcua`, `mxgateway`, `scadabridge` |
|
||||
| `service.namespace` | string | Yes | Always `"ZB.MOM.WW"` — do not override |
|
||||
| `service.version` | string | Recommended | Populate from `AssemblyInformationalVersion`; absent is better than wrong |
|
||||
| `site.id` | string | Recommended | Physical or logical site identifier; omit for single-site deployments |
|
||||
| `node.role` | string | Recommended | Node function: `"central"`, `"site"`, `"hub"`, `"standalone"` |
|
||||
| `host.name` | string | Auto | Always `Environment.MachineName`; never override |
|
||||
|
||||
**Why `site.id` and `node.role` matter:** a ScadaBridge fleet runs N site clusters + one
|
||||
central cluster, each on different hosts. Without `site.id` and `node.role`, metrics from a
|
||||
site node and the central node are indistinguishable even if `host.name` differs.
|
||||
|
||||
---
|
||||
|
||||
## 5. Standard instrumentation baseline
|
||||
|
||||
Every app enables this baseline via `AddZbTelemetry`. No opt-out. These are community-
|
||||
standard instrumentation packages; the overhead is negligible and the benefit (correlated
|
||||
HTTP / gRPC request traces across the fleet) is high.
|
||||
|
||||
| Instrumentation | Signal(s) | OtOpcUa today | MxGateway today | ScadaBridge today |
|
||||
|---|---|---|---|---|
|
||||
| ASP.NET Core | Traces + Metrics | ✅ | ✅ | — |
|
||||
| HttpClient | Traces + Metrics | ✅ | ✅ | — |
|
||||
| gRPC client | Traces | ✅ | — | — |
|
||||
| .NET runtime | Metrics | ✅ | — | — |
|
||||
| Process | Metrics | ✅ | — | — |
|
||||
|
||||
OtOpcUa already enables all five. MxGateway and ScadaBridge add the missing ones through
|
||||
`AddZbTelemetry`. No project removes any of these.
|
||||
|
||||
---
|
||||
|
||||
## 6. Per-app instrument surface (bespoke — stays per project)
|
||||
|
||||
These instruments are **not part of the shared library**. They document the existing bespoke
|
||||
surface that each project registers through `o.Meters` / `o.ActivitySources` in `AddZbTelemetry`.
|
||||
|
||||
### 6.1 OtOpcUa — `ZB.MOM.WW.OtOpcUa` meter
|
||||
|
||||
Source: `src/Core/ZB.MOM.WW.OtOpcUa.Commons/Observability/OtOpcUaTelemetry.cs`
|
||||
|
||||
| Instrument | Kind | Unit | Description |
|
||||
|---|---|---|---|
|
||||
| `otopcua.deploy.applied` | Counter | `"1"` | Galaxy deploy events applied to the OPC UA address space |
|
||||
| `otopcua.deploy.failed` | Counter | `"1"` | Deploy events that failed processing |
|
||||
| `otopcua.tag.subscriptions` | UpDownCounter | `"1"` | Active OPC UA tag subscriptions |
|
||||
| `otopcua.tag.reads` | Counter | `"1"` | Tag read operations |
|
||||
| `otopcua.tag.writes` | Counter | `"1"` | Tag write operations |
|
||||
| `otopcua.session.active` | UpDownCounter | `"1"` | Active OPC UA sessions |
|
||||
| `otopcua.connection.gateway` | UpDownCounter | `"1"` | Active gRPC channels to MxAccessGateway |
|
||||
|
||||
**ActivitySources (spans):**
|
||||
|
||||
| Source name | Span(s) |
|
||||
|---|---|
|
||||
| `ZB.MOM.WW.OtOpcUa` | `DeployWatcher.Apply`, `GalaxyDriver.BrowseHierarchy` |
|
||||
|
||||
All durations already use `"s"` — no convergence item for OtOpcUa.
|
||||
|
||||
### 6.2 MxGateway — `MxGateway.Server` meter (→ target: `ZB.MOM.WW.MxGateway`)
|
||||
|
||||
Source: `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs`
|
||||
|
||||
**Counters (13):**
|
||||
|
||||
| Instrument | Unit | Description |
|
||||
|---|---|---|
|
||||
| `mxgateway.session.created` | `"1"` | MxAccess sessions opened |
|
||||
| `mxgateway.session.closed` | `"1"` | MxAccess sessions closed |
|
||||
| `mxgateway.session.errors` | `"1"` | Session creation/teardown errors |
|
||||
| `mxgateway.command.invoked` | `"1"` | MxAccess command invocations |
|
||||
| `mxgateway.command.errors` | `"1"` | Command invocation errors |
|
||||
| `mxgateway.event.received` | `"1"` | MxAccess events received from worker |
|
||||
| `mxgateway.event.errors` | `"1"` | Event processing errors |
|
||||
| `mxgateway.worker.started` | `"1"` | x86 worker processes started |
|
||||
| `mxgateway.worker.stopped` | `"1"` | x86 worker processes stopped |
|
||||
| `mxgateway.worker.errors` | `"1"` | Worker communication errors |
|
||||
| `mxgateway.galaxy.browse.requests` | `"1"` | Galaxy Repository browse RPCs |
|
||||
| `mxgateway.galaxy.browse.errors` | `"1"` | Galaxy browse errors |
|
||||
| `mxgateway.auth.failures` | `"1"` | Authentication failures |
|
||||
|
||||
**Histograms (3):**
|
||||
|
||||
| Instrument | Unit | Current unit | Convergence |
|
||||
|---|---|---|---|
|
||||
| `mxgateway.command.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
||||
| `mxgateway.event.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
||||
| `mxgateway.worker.call.duration` | `"s"` | `"ms"` | ⚠ Convert ms→s on adoption |
|
||||
|
||||
**Gauges (4):**
|
||||
|
||||
| Instrument | Unit | Description |
|
||||
|---|---|---|
|
||||
| `mxgateway.session.active` | `"1"` | Current active MxAccess sessions |
|
||||
| `mxgateway.worker.active` | `"1"` | Current running x86 worker processes |
|
||||
| `mxgateway.worker.memory` | `"By"` | Worker process RSS |
|
||||
| `mxgateway.galaxy.nodes.cached` | `"1"` | Galaxy Repository nodes in browse cache |
|
||||
|
||||
No ActivitySources today (no tracing). Adding `ZB.MOM.WW.MxGateway` as an ActivitySource
|
||||
is left per-project (deferred to GAPS backlog).
|
||||
|
||||
### 6.3 ScadaBridge — `ZB.MOM.WW.ScadaBridge` meter
|
||||
|
||||
No meter or instruments exist today (`OpenTelemetry.Api` is a dangling ref). The target
|
||||
meter name `ZB.MOM.WW.ScadaBridge` is reserved. Instruments are defined as part of the
|
||||
ScadaBridge adoption tracked in [`../GAPS.md`](../GAPS.md).
|
||||
|
||||
---
|
||||
|
||||
## Consequences and convergence items (accepted)
|
||||
|
||||
| Item | Scope | Severity |
|
||||
|---|---|---|
|
||||
| MxGateway meter rename `MxGateway.Server` → `ZB.MOM.WW.MxGateway` | MxGateway adoption | Breaking — requires relabeling in Prometheus config and dashboards |
|
||||
| MxGateway histogram unit `ms` → `s` (3 instruments) | MxGateway adoption | Breaking — values change by factor 1 000; dashboards need updating |
|
||||
| ScadaBridge instrument set TBD | ScadaBridge adoption | No existing surface to converge — define from scratch |
|
||||
|
||||
All three items are tracked as backlog entries in [`../GAPS.md`](../GAPS.md). The ms→s
|
||||
migration is the highest-priority convergence item because leaving it unresolved means
|
||||
MxGateway's durations are incompatible with OtOpcUa's in any shared Prometheus / Grafana
|
||||
workspace.
|
||||
@@ -0,0 +1,177 @@
|
||||
# Observability — normalized target spec
|
||||
|
||||
Status: **Draft**. The single design the sister projects converge on. Derived from the
|
||||
three code-verified current-state docs (`../current-state/`). Goal is *path to shared code*
|
||||
(`../shared-contract/ZB.MOM.WW.Telemetry.md`), so each normalized section maps to a shared
|
||||
library seam.
|
||||
|
||||
## 0. Scope
|
||||
|
||||
**Normalized here:** one OpenTelemetry bootstrap across all three signals (metrics + traces +
|
||||
logs) via a single `AddZbTelemetry` extension; the shared `Resource` attribute set
|
||||
(`service.name` / `service.namespace` / `service.version` / `site.id` / `node.role` /
|
||||
`host.name`) that makes every node distinguishable in a collector; standard instrumentation
|
||||
everyone enables (ASP.NET Core, HttpClient, gRPC client, runtime, process meters); exporter
|
||||
conventions (Prometheus scrape endpoint default, OTLP opt-in); a shared Serilog bootstrap
|
||||
with identity enrichers (`SiteId`, `NodeRole`, `NodeHostname`) bound from the same options
|
||||
object as the OTel Resource (metrics and logs therefore carry identical dimensions); a
|
||||
`TraceContextEnricher` that stamps `trace_id`/`span_id` from `Activity.Current` onto every
|
||||
Serilog event, enabling log↔trace correlation; an `ILogRedactor` redaction seam.
|
||||
|
||||
**Explicitly NOT normalized** (domain-specific — keep per project): each app's actual
|
||||
instruments — `otopcua.*` meters and spans, `mxgateway.*` counters/histograms/gauges — they
|
||||
are registered *through* the shared bootstrap but their names and semantics remain
|
||||
bespoke (see [`METRIC-CONVENTIONS.md`](METRIC-CONVENTIONS.md) §4); the redaction *policy*
|
||||
(which field names, which command types) — only the `ILogRedactor` seam is shared, each
|
||||
project supplies its own implementation; the MxGateway net48 x86 worker's `IWorkerLogger`
|
||||
(stderr key=value format, out-of-process, out of scope).
|
||||
|
||||
## 1. OpenTelemetry pipeline — `AddZbTelemetry`
|
||||
|
||||
A single `IHostApplicationBuilder` extension is the front door for all three OTel signals.
|
||||
It wires the shared `Resource`, registers standard instrumentation, and configures the
|
||||
selected exporter:
|
||||
|
||||
```csharp
|
||||
builder.AddZbTelemetry(o =>
|
||||
{
|
||||
o.ServiceName = "mxgateway"; // populates Resource service.name
|
||||
o.ServiceNamespace = "ZB.MOM.WW"; // constant across the fleet (default)
|
||||
o.ServiceVersion = "1.0.0"; // populated from AssemblyInformationalVersion
|
||||
o.SiteId = cfg.SiteId; // Resource site.id + Serilog SiteId property
|
||||
o.NodeRole = cfg.NodeRole; // Resource node.role + Serilog NodeRole property
|
||||
o.Meters = ["MxGateway.Server"]; // app's own Meter name(s)
|
||||
o.ActivitySources = ["MxGateway.Server"]; // app's own ActivitySource name(s)
|
||||
o.Exporter = ZbExporter.Prometheus; // default; ZbExporter.Otlp opt-in
|
||||
// o.OtlpEndpoint = "http://collector:4317"; // required when Exporter = Otlp
|
||||
});
|
||||
|
||||
app.MapZbMetrics(); // mounts Prometheus /metrics scrape endpoint
|
||||
```
|
||||
|
||||
This is the headline fix: nobody in the fleet sets a `Resource` or `service.name` today,
|
||||
making every node indistinguishable in a collector. Every project must call `AddZbTelemetry`
|
||||
to be observable.
|
||||
|
||||
## 2. Shared Resource
|
||||
|
||||
The OTel `Resource` attached to all three signals is built from `ZbTelemetryOptions`:
|
||||
|
||||
| OTel attribute | Options property | Notes |
|
||||
|---|---|---|
|
||||
| `service.name` | `ServiceName` | Required. Lower-case short identifier (`otopcua`, `mxgateway`, `scadabridge`) |
|
||||
| `service.namespace` | `ServiceNamespace` | Default `"ZB.MOM.WW"` — constant across the fleet |
|
||||
| `service.version` | `ServiceVersion` | Optional; recommend populating from `AssemblyInformationalVersion` |
|
||||
| `site.id` | `SiteId` | Optional; identifies the physical/logical site |
|
||||
| `node.role` | `NodeRole` | Optional; e.g. `"central"`, `"site"`, `"hub"` |
|
||||
| `host.name` | _(auto)_ | Always populated from `Environment.MachineName` |
|
||||
|
||||
The same `SiteId` and `NodeRole` values are passed to the Serilog enrichers (§4) so a
|
||||
metric, a span, and a log line from the same node carry identical dimensions and join up in
|
||||
any OTel-compatible backend.
|
||||
|
||||
## 3. Standard instrumentation
|
||||
|
||||
`AddZbTelemetry` enables the following instrumentation for all projects. Any project that
|
||||
already enables a subset gets it consolidated; no project may skip this baseline:
|
||||
|
||||
| Instrumentation | Package | Signal |
|
||||
|---|---|---|
|
||||
| ASP.NET Core | `OpenTelemetry.Instrumentation.AspNetCore` | Traces + Metrics |
|
||||
| HttpClient | `OpenTelemetry.Instrumentation.Http` | Traces + Metrics |
|
||||
| gRPC client | `OpenTelemetry.Instrumentation.GrpcNetClient` | Traces |
|
||||
| .NET runtime | `OpenTelemetry.Instrumentation.Runtime` | Metrics |
|
||||
| Process | `OpenTelemetry.Instrumentation.Process` | Metrics |
|
||||
|
||||
App-specific `Meter` names and `ActivitySource` names are registered via `o.Meters` and
|
||||
`o.ActivitySources`. This is how MxGateway's hand-rolled `GatewayMetrics` finally gets an
|
||||
export path instead of dying in an in-memory `GetSnapshot()`.
|
||||
|
||||
## 4. Exporter conventions
|
||||
|
||||
`ZbTelemetryOptions.Exporter` selects the export path:
|
||||
|
||||
| Value | Behaviour |
|
||||
|---|---|
|
||||
| `ZbExporter.Prometheus` | Mounts a Prometheus `/metrics` scrape endpoint via `app.MapZbMetrics()`. Default for all three apps — consistent with OtOpcUa's existing `/metrics`. |
|
||||
| `ZbExporter.Otlp` | Exports to an OTLP endpoint specified by `o.OtlpEndpoint` (gRPC, `http://collector:4317`). Opt-in path to a real OTel Collector; coexists with Prometheus. |
|
||||
|
||||
Both exporters carry the shared `Resource`. OTLP is the path to a real backend (Tempo,
|
||||
Prometheus-remote-write, Loki); Prometheus covers the "scrape from the node" case that all
|
||||
three apps currently use or aspire to.
|
||||
|
||||
## 5. Serilog logging stack
|
||||
|
||||
`AddZbSerilog` is a companion extension in the `.Serilog` package. It replaces each
|
||||
project's bespoke logging bootstrap with a shared two-stage pattern:
|
||||
|
||||
**Stage 1 (bootstrap logger):** a minimal `Log.Logger` for startup errors before the
|
||||
`IConfiguration` is available. Writes to console only.
|
||||
|
||||
**Stage 2 (application logger):** reads sinks and overrides from `IConfiguration`
|
||||
(`ReadFrom.Configuration`) and applies a set of fixed enrichers:
|
||||
|
||||
| Enricher | Property name | Source |
|
||||
|---|---|---|
|
||||
| `ZbLogEnricherNames.SiteId` | `"SiteId"` | `ZbTelemetryOptions.SiteId` |
|
||||
| `ZbLogEnricherNames.NodeRole` | `"NodeRole"` | `ZbTelemetryOptions.NodeRole` |
|
||||
| `ZbLogEnricherNames.NodeHostname` | `"NodeHostname"` | `Environment.MachineName` |
|
||||
| `TraceContextEnricher` | `trace_id`, `span_id` | `Activity.Current` |
|
||||
| `RedactionEnricher` | _(project-defined fields)_ | `ILogRedactor` implementation |
|
||||
|
||||
The three identity properties (`SiteId`, `NodeRole`, `NodeHostname`) are bound from the
|
||||
same `ZbTelemetryOptions` object as the OTel `Resource`, so logs and metrics/traces carry
|
||||
identical dimensions. When no `Activity.Current` is present (e.g. background services,
|
||||
startup), `TraceContextEnricher` emits nothing — it does not inject empty or zero values.
|
||||
|
||||
`MinimumLevel` is set explicitly in code (default `Information`) and can be overridden via
|
||||
`IConfiguration` (`Serilog:MinimumLevel`). Sinks are fully config-driven:
|
||||
`ReadFrom.Configuration` reads `Serilog:WriteTo` from `appsettings.json` / environment.
|
||||
|
||||
OTel log export is wired in the same call: logs flow through the OTel pipeline with the
|
||||
same `Resource` attached, making all three signals (metrics / traces / logs) available in a
|
||||
single backend.
|
||||
|
||||
## 6. Redaction seam — `ILogRedactor`
|
||||
|
||||
`ILogRedactor` is a single-method interface that receives the mutable log-event property
|
||||
dictionary and scrubs any fields that must not leave the process:
|
||||
|
||||
```csharp
|
||||
public interface ILogRedactor
|
||||
{
|
||||
void Redact(IDictionary<string, object?> properties);
|
||||
}
|
||||
```
|
||||
|
||||
`RedactionEnricher` applies a registered `ILogRedactor` on every log event. The seam is
|
||||
shared; the **policy** is per-project (which field names, which command types, which
|
||||
classification levels). MxGateway's existing `GatewayLogRedactor` is the reference
|
||||
implementation; it migrates to this seam during adoption. If no `ILogRedactor` is
|
||||
registered, `RedactionEnricher` is a no-op.
|
||||
|
||||
This preserves the operational property MxGateway already has (secrets never leave the
|
||||
process in log events) while making the plumbing reusable.
|
||||
|
||||
## 7. Per-project migration
|
||||
|
||||
| Project | Current state | Primary gaps | What normalizes |
|
||||
|---|---|---|---|
|
||||
| **OtOpcUa** | Full OTel SDK (`WithMetrics` + `WithTracing`); Prometheus `/metrics`; Serilog bootstrap; 7 instruments + 2 spans. | No `Resource` / `service.name` anywhere; no trace↔log correlation; no `SiteId`/`NodeRole` enrichers. | Call `AddZbTelemetry` (adds Resource; consolidates standard instrumentation); call `AddZbSerilog` (adds `TraceContextEnricher` + identity enrichers); migrate existing Serilog bootstrap to shared two-stage pattern. |
|
||||
| **MxGateway** | Hand-rolled `GatewayMetrics` (13 counters / 3 histograms `ms` / 4 gauges); in-memory snapshot only — no export; MEL logging with `GatewayLogScope` correlation + `GatewayLogRedactor`; no OTel SDK. | No OTel SDK; no export; `ms` histograms diverge from OTel semconv (`s`); MEL → Serilog migration; no Resource. | Call `AddZbTelemetry` (wires OTel SDK around existing `GatewayMetrics` — finally exports); call `AddZbSerilog` (replaces MEL; re-expresses `GatewayLogScope` as `LogContext.PushProperty`; moves `GatewayLogRedactor` behind `ILogRedactor`). Duration unit convergence (`ms`→`s`) tracked in GAPS. **This is the one adoption done now.** |
|
||||
| **ScadaBridge** | `OpenTelemetry.Api` ref only (dangling — CVE-patch origin, zero usage); Serilog bootstrap (`LoggerConfigurationFactory`) with `SiteId`/`NodeRole`/`NodeHostname` enrichers. | No OTel SDK; no metrics; no tracing; no export; no trace↔log correlation. ScadaBridge's enricher property names are already the target names — migration is additive. | Call `AddZbTelemetry` (adds OTel SDK + metrics + traces + export); call `AddZbSerilog` (consolidates `LoggerConfigurationFactory`; adds `TraceContextEnricher`). |
|
||||
|
||||
> The MxGateway logging migration (`MEL → Serilog`, re-expressing `GatewayLogRedactor`
|
||||
> behind `ILogRedactor`) is the **only sister-repo touch in scope for this release**. OtOpcUa
|
||||
> and ScadaBridge adoption is deferred to the follow-on tracked in
|
||||
> [`../GAPS.md`](../GAPS.md).
|
||||
|
||||
## 8. Acceptance (what "converged" means)
|
||||
|
||||
A project is converged when: (a) it calls `builder.AddZbTelemetry(o => ...)` with all
|
||||
required Resource attributes populated; (b) it calls `app.MapZbMetrics()` (or configures
|
||||
OTLP); (c) it calls `builder.AddZbSerilog(...)` and the `TraceContextEnricher` stamps
|
||||
`trace_id`/`span_id` on every log event emitted under an active `Activity`; (d) its
|
||||
`ILogRedactor` implementation (if applicable) is registered and applied by `RedactionEnricher`;
|
||||
(e) every node in the fleet is distinguishable by `service.name` + `site.id` + `node.role`
|
||||
in a collector or log aggregator.
|
||||
Reference in New Issue
Block a user