215a646e35
C1: NodeHostname is AUTO throughout. Shared-contract AddZbSerilog doc comment now reads
"SiteId + NodeRole from ZbTelemetryOptions; NodeHostname from Environment.MachineName (auto)".
SPEC.md §0 and §5 prose updated to match. ScadaBridge adoption snippet no longer sets
o.NodeHostname (removed; NodeHostname is auto, not caller-supplied).
C2: METRIC-CONVENTIONS §6.1 OtOpcUa instrument table replaced with code-verified set:
counters otopcua.deploy.applied / driver.lifecycle / virtualtag.eval / scriptedalarm.transition /
opcua.sink.write / redundancy.service_level_change; histogram otopcua.deploy.apply.duration (s);
ActivitySource ZB.MOM.WW.OtOpcUa with spans otopcua.deploy.apply + otopcua.opcua.address_space_rebuild.
Removed invented names (deploy.failed, tag.subscriptions, tag.reads, tag.writes, session.active,
connection.gateway).
C3: METRIC-CONVENTIONS §6.2 MxGateway instrument table replaced with code-verified names from
GatewayMetrics.cs: 13 counters (sessions.opened/closed, commands.started/succeeded/failed,
events.received, queues.overflows, faults, workers.killed/exited, heartbeats.failed,
grpc.streams.disconnected, retries.attempted); 3 histograms ms (workers.startup.duration,
commands.duration, events.stream_send.duration); 4 gauges (sessions.open, workers.running,
events.worker_queue.depth, events.grpc_stream_queue.depth). Removed invented names.
m3: §2 example table replaced mxgateway.session.active + mxgateway.worker.call.duration
(invented) with mxgateway.sessions.open + mxgateway.commands.duration (real). Also fixed
the §2 rule-2 body text example which referenced mxgateway.worker.call.duration.
I4: §5 standard instrumentation table corrected — OtOpcUa now shows ⛔ not added for all
five baseline instrumentations, matching current-state/otopcua. All three projects lack
standard instrumentation today; AddZbTelemetry adds it on adoption.
I1+m1: GAPS.md "Decisions still open" — removed the two settled questions (Prometheus-default
and ms→s/meter-rename bundling). Moved them to a new "Decisions settled" section with explicit
resolution notes. One genuinely open question remains (SiteId/NodeRole config binding path).
I2: SPEC.md §5 AddZbSerilog: added note that AddZbSerilog reads Serilog:MinimumLevel from
IConfiguration; callers with a different config key (e.g. ScadaBridge:Logging:MinimumLevel)
apply that override themselves — stays per-project. Shared-contract doc comment updated to match.
I3: MxAccessGateway adoption plan Meters = ["MxGateway.Server"] annotated as temporary with
note to update to ZB.MOM.WW.MxGateway when Gap N1 (Meter-rename) is closed.
m2: SPEC.md §1 now notes AddZbTelemetry also has an IServiceCollection overload for non-standard
hosts, with the IHostApplicationBuilder overload as the primary path.
184 lines
11 KiB
Markdown
184 lines
11 KiB
Markdown
# Observability — gaps & adoption backlog
|
||
|
||
Divergence of each project from [`spec/SPEC.md`](spec/SPEC.md), and the ordered backlog to
|
||
reach the shared `ZB.MOM.WW.Telemetry` library. Status legend: ⛔ gap · 🟡 partial · ✅ matches.
|
||
|
||
## Divergence vs spec
|
||
|
||
### §1 OTel Resource / `service.name` (P1 — nobody has it)
|
||
|
||
| Spec attribute | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| `service.name` | ⛔ not set | ⛔ not set | ⛔ not set |
|
||
| `service.namespace` | ⛔ not set | ⛔ not set | ⛔ not set |
|
||
| `service.version` | ⛔ not set | ⛔ not set | ⛔ not set |
|
||
| `site.id` | ⛔ not set | ⛔ not set | ⛔ not set |
|
||
| `node.role` | ⛔ not set | ⛔ not set | ⛔ not set |
|
||
| `host.name` | ⛔ not set | ⛔ not set | ⛔ not set |
|
||
|
||
No project configures a `ResourceBuilder` or equivalent. Every metric and span from every node is
|
||
**indistinguishable** in a backend — no service identity, no topology (site/role), no version label.
|
||
This is the single highest-value gap across the fleet; closing it requires only adding
|
||
`AddZbTelemetry` with options.
|
||
|
||
→ **Gap R1 (P1):** All three projects must call `AddZbTelemetry` with `ServiceName`, `SiteId`,
|
||
`NodeRole` options to populate the shared Resource. None may do so before the library is available.
|
||
|
||
### §2 Metrics export (P1 — MxGateway metrics are invisible)
|
||
|
||
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| OTel SDK present | ✅ (`AddOpenTelemetry`) | ⛔ none | ⛔ none (`OpenTelemetry.Api` only) |
|
||
| Meter registered | ✅ `ZB.MOM.WW.OtOpcUa` | 🟡 `MxGateway.Server` (not via OTel SDK) | ⛔ no Meter |
|
||
| Prometheus export | ✅ `/metrics` | ⛔ `GetSnapshot()` only | ⛔ none |
|
||
| OTLP export | ⛔ not available | ⛔ not available | ⛔ not available |
|
||
|
||
MxGateway has 20 production instruments (13 counters, 3 histograms, 4 gauges) recording real data
|
||
that is **never exported**. The data lives only in an in-memory `GetSnapshot()` read path. Adding
|
||
`AddZbTelemetry` with `Meters = ["MxGateway.Server"]` closes this gap without touching
|
||
`GatewayMetrics.cs`.
|
||
|
||
→ **Gap M1 (P1):** MxGateway: wire OTel SDK via `AddZbTelemetry` so `GatewayMetrics` exports.
|
||
This is one half of the in-pass adoption (logging migration is the other).
|
||
→ **Gap M2:** ScadaBridge: define a `ScadaBridgeTelemetry` class and first application instruments
|
||
(`scadabridge.*`); register via `AddZbTelemetry`. Currently zero instrumentation.
|
||
|
||
### §3 MxGateway logging: MEL → Serilog (P1 — in-pass adoption, done in this task)
|
||
|
||
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| Logging framework | ✅ Serilog | ⛔ MEL only | ✅ Serilog |
|
||
| Structural enrichers | 🟡 driver-scope only | ⛔ MEL scope (provider-dependent) | ✅ `SiteId`/`NodeRole`/`NodeHostname` |
|
||
| Correlation mechanism | `LogContextEnricher.Push` | `GatewayLogScope` + `BeginScope` | structural enrichers |
|
||
| Log redaction | ⛔ none | ✅ `GatewayLogRedactor` | ⛔ none |
|
||
| `ILogRedactor` seam | ⛔ none | ⛔ bespoke | ⛔ none |
|
||
|
||
**Status: done in this task (Task #9).** The MxGateway logging migration — MEL → `AddZbSerilog`,
|
||
`GatewayLogScope` → `LogContext.PushProperty`, `GatewayLogRedactor` → `ILogRedactor` seam — is
|
||
executed as part of the `ZB.MOM.WW.Telemetry` library build. See
|
||
[`current-state/mxaccessgw/CURRENT-STATE.md`](current-state/mxaccessgw/CURRENT-STATE.md) adoption
|
||
plan for the exact changes.
|
||
|
||
→ **Gap L1 (in-pass, done):** MxGateway MEL → `AddZbSerilog` + `LogContext` correlation + `ILogRedactor`.
|
||
|
||
### §4 Trace↔log correlation (nobody has it)
|
||
|
||
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| `trace_id`/`span_id` in logs | ⛔ absent | ⛔ absent (no spans) | ⛔ absent (no spans) |
|
||
| Mechanism | `TraceContextEnricher` not wired | n/a | n/a |
|
||
|
||
OtOpcUa creates spans (`otopcua.deploy.apply`, `otopcua.opcua.address_space_rebuild`) but never
|
||
pushes `Activity.Current`'s trace context onto Serilog's `LogContext`. A log emitted inside a span
|
||
cannot be joined to the span in a backend. MxGateway and ScadaBridge have no spans at all.
|
||
|
||
The shared `TraceContextEnricher` (part of `ZB.MOM.WW.Telemetry.Serilog`) closes this for all
|
||
three projects on `AddZbSerilog` adoption — it is wired automatically.
|
||
|
||
→ **Gap C1:** OtOpcUa: adopt `AddZbSerilog` to wire `TraceContextEnricher`; spans and logs become
|
||
joinable once the enricher is active.
|
||
→ **Gap C2:** ScadaBridge/MxGateway: enricher wired on adoption; effective once spans are created.
|
||
|
||
### §5 Duration unit: `ms` vs `s`
|
||
|
||
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| Histogram unit | ✅ `s` | ⛔ `ms` (`workers.startup.duration`, `commands.duration`, `events.stream_send.duration`) | n/a (no histograms) |
|
||
|
||
MxGateway's three histograms record in milliseconds. OTel semantic conventions (and OtOpcUa) use
|
||
seconds. Raw values differ by a factor of 1000 — dashboards and SLO alerts would need separate
|
||
scaling rules without normalization.
|
||
|
||
→ **Gap U1:** MxGateway: convert the three histogram call sites to record in seconds (multiply by
|
||
`0.001` or redefine the instruments). This is a **breaking change to existing dashboards** and is
|
||
tagged as a convergence item separate from the initial adoption.
|
||
|
||
### §6 Meter naming: `MxGateway.Server` vs namespace convention
|
||
|
||
| | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| Meter name | ✅ `ZB.MOM.WW.OtOpcUa` | ⛔ `MxGateway.Server` | n/a |
|
||
|
||
The fleet convention (per `spec/SPEC.md`) is `<project-namespace>` — OtOpcUa uses
|
||
`ZB.MOM.WW.OtOpcUa`; the gateway's meter should be `ZB.MOM.WW.MxGateway`. `MxGateway.Server` is
|
||
the assembly name, not the namespace, and does not carry the `ZB.MOM.WW` product prefix.
|
||
|
||
→ **Gap N1:** MxGateway: rename Meter from `"MxGateway.Server"` → `"ZB.MOM.WW.MxGateway"`.
|
||
This is a **Prometheus metric label change** that breaks dashboards/alerts and is tracked
|
||
separately from the initial adoption.
|
||
|
||
### §7 Standard instrumentation (nobody has the full set)
|
||
|
||
| Instrumentation | OtOpcUa | MxAccessGateway | ScadaBridge |
|
||
|---|---|---|---|
|
||
| ASP.NET Core request metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
||
| `HttpClient` metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
||
| Runtime / process metrics | ⛔ not added | ⛔ not added | ⛔ not added |
|
||
| gRPC client metrics | ⛔ not added | ⛔ not added | n/a |
|
||
|
||
`AddZbTelemetry` enables these via `AddAspNetCoreInstrumentation`, `AddHttpClientInstrumentation`,
|
||
`AddRuntimeInstrumentation`, and `AddProcessInstrumentation` by default — all three projects gain
|
||
them on `AddZbTelemetry` adoption without code changes.
|
||
|
||
→ **Gap S1:** all three projects lack standard instrumentation; closed automatically by
|
||
`AddZbTelemetry` adoption.
|
||
|
||
### §8 Serilog version split (`Serilog.AspNetCore` 8 vs 9)
|
||
|
||
OtOpcUa and ScadaBridge use `Serilog.AspNetCore` but may be on different versions due to their
|
||
independent csproj updates. The shared `ZB.MOM.WW.Telemetry.Serilog` package must declare a
|
||
`Serilog.AspNetCore` version floor that works for both. Verify version alignment on adoption.
|
||
|
||
→ **Gap V1:** confirm `Serilog.AspNetCore` version compatibility across all three projects on
|
||
`AddZbSerilog` adoption; align if diverged.
|
||
|
||
### §9 ScadaBridge: zero instrumentation
|
||
|
||
ScadaBridge has no application instruments today. The `OpenTelemetry.Api` ref is a CVE patch, not
|
||
instrumentation. No counter, histogram, gauge, or span exists anywhere in the solution.
|
||
|
||
→ **Gap I1:** ScadaBridge: define `ScadaBridgeTelemetry` with a `ZB.MOM.WW.ScadaBridge` Meter and
|
||
a first set of `scadabridge.*` instruments. Tracked as a follow-on (application-specific work,
|
||
not shared-library work).
|
||
|
||
## Adoption backlog (ordered)
|
||
|
||
| # | Item | Projects | Priority | Effort | Risk | Notes |
|
||
|---|---|---|---|---|---|---|
|
||
| 1 | MxGateway MEL → `AddZbSerilog`: `LogContext` correlation + `ILogRedactor` (Gap L1) | MxGateway | P1 | M | Low | **In-pass, done in Task #9** — unblocked by library build |
|
||
| 2 | MxGateway: wire OTel SDK via `AddZbTelemetry`; `GatewayMetrics` begins exporting (Gap M1) | MxGateway | P1 | S | Low | Bundled with #1 in Task #9 |
|
||
| 3 | All: `AddZbTelemetry` with `ServiceName`/`SiteId`/`NodeRole` → shared Resource (Gap R1) | OtOpcUa, ScadaBridge | P1 | S | Low | MxGateway covered by #2; others are follow-on |
|
||
| 4 | OtOpcUa: adopt `AddZbSerilog` + `TraceContextEnricher` (Gaps C1, V1) | OtOpcUa | P2 | S | Low | Keep `LogContextEnricher`; add shared enrichers alongside |
|
||
| 5 | ScadaBridge: adopt `AddZbSerilog`; replace `LoggerConfigurationFactory` (Gap C2, V1) | ScadaBridge | P2 | S | Low | Enricher names already match; `LoggerConfigurationFactory` deleted |
|
||
| 6 | MxGateway: histogram `ms` → `s` conversion (Gap U1) | MxGateway | P2 | S | Med | Breaking dashboard/alert change; coordinate with ops |
|
||
| 7 | MxGateway: rename Meter `"MxGateway.Server"` → `"ZB.MOM.WW.MxGateway"` (Gap N1) | MxGateway | P3 | XS | Med | Breaking Prometheus label change; coordinate with ops |
|
||
| 8 | All: standard instrumentation via `AddZbTelemetry` options (Gap S1) | all 3 | P2 | XS | Low | Automatic on `AddZbTelemetry` adoption; no extra code |
|
||
| 9 | ScadaBridge: define `ScadaBridgeTelemetry` + first `scadabridge.*` instruments (Gap I1) | ScadaBridge | P2 | M | Low | Application-specific work; tracked in ScadaBridge repo |
|
||
| 10 | OtOpcUa: wire OTLP exporter alongside Prometheus (Gap M2) | OtOpcUa | P3 | S | Low | Opt-in via `AddZbTelemetry` options; no code rewrite |
|
||
| 11 | All: fix the tracing no-op in OtOpcUa (spans recorded, no exporter) | OtOpcUa | P2 | S | Low | Wire OTLP or at minimum document the gap |
|
||
|
||
**Sequencing:** Items #1 and #2 are the in-pass adoption (Task #9, MxGateway only). Items #3–#5
|
||
are the follow-on OtOpcUa/ScadaBridge adoptions — they are independent of each other and can land
|
||
in either order. Items #6–#7 (unit/naming convergence) are breaking changes requiring ops
|
||
coordination; defer until dashboards can be updated. Items #8–#11 are cleanups that bundle
|
||
naturally with #3–#5.
|
||
|
||
This mirrors the Auth and UI-Theme pattern: the shared library is built first; adoption by each
|
||
app is opt-in and tracked here, not forced.
|
||
|
||
## Decisions still open
|
||
|
||
- Canonical `SiteId` and `NodeRole` config binding path — ScadaBridge reads from its own config
|
||
hierarchy; `AddZbSerilog` must accept the value directly (caller-supplied) rather than reading
|
||
from a fixed config section, to remain project-agnostic.
|
||
|
||
## Decisions settled (no longer open)
|
||
|
||
- **Prometheus vs OTLP default (SETTLED):** `AddZbTelemetry` defaults to Prometheus (matching
|
||
OtOpcUa's existing `/metrics` posture). OTLP is opt-in via `ZbTelemetryOptions.Exporter =
|
||
ZbExporter.Otlp`. See `spec/SPEC.md` §4 and shared-contract `ZbTelemetryOptions.Exporter`.
|
||
- **`ms`→`s` conversion and Meter rename bundling (SETTLED — deferred):** Both the histogram
|
||
unit migration (Gap U1) and the Meter rename (Gap N1) are deferred from the initial MxGateway
|
||
adoption (Task #9). They are breaking dashboard/alert changes requiring ops coordination and
|
||
are tracked as separate backlog items #6 and #7 in the adoption backlog above.
|