Files
scadaproj/code-reviews/Telemetry/findings.md
T
Joseph Doherty 7ae25f8510 Re-stamp Telemetry-002/003 resolutions: nested redaction implemented in 05cc62a
Telemetry-002 was first resolved by documenting the scalar-only limitation; it is now
implemented (recursive nested redaction). Updated the two resolution notes to record
05cc62a and the replaced limitation test, preserving the audit trail. README unchanged
(still 0 pending / 35 total).
2026-06-01 12:13:05 -04:00

345 lines
18 KiB
Markdown

# Code Review — Telemetry
| Field | Value |
|-------|-------|
| Library | `ZB.MOM.WW.Telemetry/` |
| Packages | `ZB.MOM.WW.Telemetry`, `ZB.MOM.WW.Telemetry.Serilog` |
| Component spec | `components/observability/spec/SPEC.md` |
| Shared contract | `components/observability/shared-contract/ZB.MOM.WW.Telemetry.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-01 |
| Reviewer | Claude (automated baseline) |
| Commit reviewed | `5f75cd4` |
| Open findings | 0 |
## Summary
The library is small, focused, and well-structured: two packages with a clean Serilog/OTel
boundary (the `Serilog.*` stack appears only in the `.Serilog` package; the core package is
pure OTel + ASP.NET Core framework reference), correct argument validation, deliberate
`sealed` types, thorough XML docs, and a deliberate no-process-global-state design for
`AddZbSerilog` that is well covered by `MultiHostTests`. The identity triple, Resource
omission rules, exporter wiring (Prometheus always-on, OTLP additive), and trace/log
correlation all match the spec's intent and are exercised by the 19 tests.
The most material problems are in the redaction seam — the one component the review brief
flags as security-critical. `RedactionEnricher` honours only *replacement* of scalar
properties: it silently ignores the redactor **removing** a key (a documented capability of
`ILogRedactor`), and it cannot see inside destructured/structured property values, so a
secret logged as a field of `{@Object}` is never scrubbed. Both let secrets reach sinks
despite a conforming redactor. Secondary themes: a spec drift around an undocumented
`service.instance.id` Resource attribute, two hand-maintained Resource-attribute builders
that can drift apart, and a stale doc-comment on `MapZbMetrics`. Tests are solid for the
happy paths but have no coverage for redactor removal or structured-value redaction.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Redactor "remove" path is a no-op (Telemetry-001); structured values opaque to redactor (Telemetry-002). |
| 2 | Public API surface & compatibility | ☑ | Surface minimal, `sealed`, nullable-correct. `ZbResource.InstanceId` is an added public member not in the contract (Telemetry-004). |
| 3 | Concurrency & thread safety | ☑ | No issues found. Enrichers stateless; `Lazy` uses `ExecutionAndPublication`; `Activity.Current` is async-local. |
| 4 | Error handling & resilience | ☑ | Guard clauses present. `new Uri(OtlpEndpoint)` can throw late on malformed input (Telemetry-006). |
| 5 | Security & secret handling | ☑ | Redaction gaps (Telemetry-001/002) are security-relevant — secrets can survive a conforming redactor. |
| 6 | Performance & resource management | ☑ | Per-event dictionary snapshot when a redactor is registered (Telemetry-007); acceptable but noted. |
| 7 | Spec & shared-contract adherence | ☑ | Undocumented `service.instance.id` attribute (Telemetry-004); two Resource builders that can drift (Telemetry-005). |
| 8 | Packaging, dependencies & project layout | ☑ | No issues found. Serilog stack confined to `.Serilog`; central versions; correct net10.0; framework ref justified. |
| 9 | Testing coverage | ☑ | No tests for redactor removal or structured-value redaction (Telemetry-003). |
| 10 | Documentation & XML docs | ☑ | `MapZbMetrics` doc-comment is stale: claims "only valid when Exporter = Prometheus" (Telemetry-008). |
## Findings
### Telemetry-001 — `RedactionEnricher` ignores property removal, leaving secrets in the event
| | |
|--|--|
| Severity | High |
| Category | Security & secret handling |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-67` |
**Description**
`ILogRedactor.Redact` is documented to let a project "remove or replace any sensitive
values" (`ILogRedactor.cs:13` and the XML doc on the interface method: *"remove or replace"*;
the shared contract repeats *"Remove or replace any sensitive values"*). `RedactionEnricher`
builds a `snapshot` dictionary, hands it to the redactor, then writes back only via
`AddOrUpdateProperty` for entries that remain in the snapshot and `HasChanged`:
```csharp
foreach (var entry in snapshot)
{
if (HasChanged(logEvent, entry.Key, entry.Value))
logEvent.AddOrUpdateProperty(propertyFactory.CreateProperty(entry.Key, entry.Value));
}
```
If a redactor *removes* a key from the dictionary (`properties.Remove("apiKey")`) — the most
natural way to implement "must not leave the process" — that key simply no longer appears in
the write-back loop, so the original property is **never removed from `logEvent`**. The
secret reaches every sink unredacted, even though the redactor did exactly what its contract
permits. This defeats the seam's stated operational guarantee ("secrets never leave the
process in log events") for any removal-style redactor.
**Recommendation**
After calling the redactor, reconcile deletions: for each property key present on the
original `logEvent` but absent from the returned `snapshot`, call
`logEvent.RemovePropertyIfPresent(key)`. (Capture the original key set before mutation, then
diff.) Add a test asserting a removing redactor scrubs the property (see Telemetry-003).
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — `RedactionEnricher` now captures the original property
key set and calls `RemovePropertyIfPresent` for any key the redactor dropped from the snapshot,
so a removing redactor scrubs the property; covered by a new removing-redactor test.
### Telemetry-002 — Redactor cannot inspect or scrub destructured/structured property values
| | |
|--|--|
| Severity | Medium |
| Category | Security & secret handling |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-55` |
**Description**
The snapshot only unwraps `ScalarValue`; every other `LogEventPropertyValue`
(`StructureValue` from `{@Object}`, `SequenceValue`, `DictionaryValue`) is passed to the
redactor as the raw Serilog wrapper object:
```csharp
snapshot[property.Key] = property.Value is ScalarValue scalar ? scalar.Value : property.Value;
```
A project redactor written against the seam (`IDictionary<string, object?>` of "values")
therefore sees an opaque `StructureValue` for a destructured payload — it cannot read or
mask a secret *field inside* a logged object (e.g. `logger.Information("{@Command}", cmd)`
where `cmd.ApiKey` is sensitive). MxGateway's reference redactor specifically guards
"which command payloads must not leave the process" (per `ILogRedactor` XML doc and the
contract), which is precisely the destructured-object case. The seam silently cannot meet
that requirement; the redactor only works for top-level scalar properties.
**Recommendation**
Document the seam's actual reach (scalar top-level properties only) on `ILogRedactor` and in
the shared contract, and/or recursively project `StructureValue`/`SequenceValue`/
`DictionaryValue` into the snapshot and rebuild them on write-back so nested fields are
redactable. At minimum, make the limitation explicit so consumers do not assume nested
payloads are scrubbed when they are not.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 (documented the scalar-only limitation), then **superseded by
`05cc62a`, 2026-06-01 — nested redaction implemented.** `RedactionEnricher` now projects each
structured value into a mutable nested view the redactor descends into recursively
(`StructureValue``IDictionary<string,object?>`, `SequenceValue``IList<object?>`,
`DictionaryValue``IDictionary<string,object?>`), so a field nested inside a `{@Object}` can be
masked or removed. The Project/Rebuild round-trip preserves `StructureValue.TypeTag` and original
dictionary keys, and a structural `ValueEquals` skips write-back for properties the redactor left
untouched (no reallocation; scalar fast path retained). The earlier documented-limitation wording on
the `ILogRedactor` XML doc, shared contract, and README was replaced to document the recursive reach.
### Telemetry-003 — No tests for redactor removal or structured-value redaction
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/tests/ZB.MOM.WW.Telemetry.Serilog.Tests/RedactionTests.cs:33-69` |
**Description**
`RedactionTests` covers exactly two redaction behaviours: a registered redactor replacing a
scalar value, and a no-op when none is registered. The `FakeRedactor` only ever *reassigns*
`properties["apiKey"]`. There is no test that a redactor which **removes** a key actually
scrubs it (the Telemetry-001 defect would have been caught), and no test that a redactor can
mask a field of a destructured/structured property (Telemetry-002). For a seam whose entire
purpose is secret containment, the most security-relevant behaviours are untested.
**Recommendation**
Add tests: (a) a redactor calling `properties.Remove(key)` results in the property being
absent from the emitted `LogEvent`; (b) a redactor attempting to mask a nested field of a
`{@Object}` payload, asserting the documented behaviour (whichever resolution Telemetry-002
takes). These should fail today and pin the fixes.
**Resolution**
Resolved in `544a6dd`, 2026-06-01, then extended in `05cc62a` — added
`Removing_redactor_scrubs_the_property_from_the_event` (red→green for Telemetry-001), a
Resource-attribute parity test, and (for the Telemetry-002 implementation) a nested-reach suite:
mask and remove a field inside a destructured `{@Object}`, mask a sequence element, mask a
dictionary value, mask a field two levels deep, and an untouched-structure-survives check. The
earlier `Redactor_cannot_reach_a_field_inside_a_destructured_object` limitation test was replaced.
### Telemetry-004 — `service.instance.id` Resource attribute is undocumented in spec and contract
| | |
|--|--|
| Severity | Low |
| Category | Spec & shared-contract adherence |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbResource.cs:19-45` |
**Description**
`ZbResource` adds a `service.instance.id` attribute (deterministic `MachineName:ProcessId`)
to the Resource, and exposes it as a new public member `ZbResource.InstanceId`. The
normalized Resource attribute set is enumerated exhaustively in two authoritative docs —
`SPEC.md` §2 and `METRIC-CONVENTIONS.md` §4 — and **neither lists `service.instance.id`**;
the shared contract (`ZB.MOM.WW.Telemetry.md`) likewise documents `ZbResource.Build` as
populating only `service.name/namespace/version/site.id/node.role/host.name` and does not
mention an `InstanceId` member. The attribute itself is a reasonable, standards-aligned
improvement (and disabling the OTel SDK's random-GUID default is sensible for cross-signal
correlation), but it is a silent divergence: the spec/contract are now stale relative to the
code. Per REVIEW-PROCESS §2.7, both directions of drift must be flagged.
**Recommendation**
Add `service.instance.id` (with the `MachineName:ProcessId` rationale) to the Resource table
in `SPEC.md` §2 and `METRIC-CONVENTIONS.md` §4, and document the public `ZbResource.InstanceId`
member in the shared contract, so the normalized spec and the code agree.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — kept the attribute (documented the
`MachineName:ProcessId` rationale) and added `service.instance.id` to the Resource tables in
`SPEC.md` §2 and `METRIC-CONVENTIONS.md` §4, plus the `ZbResource.InstanceId` member to the shared
contract; spec and code now agree.
### Telemetry-005 — Two hand-maintained Resource-attribute builders can silently drift
| | |
|--|--|
| Severity | Low |
| Category | Spec & shared-contract adherence |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbResource.cs:38-64`, `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/ZbSerilogConfig.cs:125-151` |
**Description**
The Resource attached to metrics/traces is built by `ZbResource.Configure` (via the OTel
`AddService` + `AddAttributes` API), while the Resource attached to the OTLP *log* sink is
built independently by `ZbSerilogConfig.BuildResourceAttributes` (a hand-rolled
`Dictionary<string, object>`). The two currently agree, but they enumerate the same six/seven
attributes in two places with two different mechanisms, so a future change to one (a new
attribute, a renamed key, a changed omission rule) will silently desynchronize logs from
metrics/traces and break the cross-signal correlation the library's whole "unifying hinge"
depends on. There is no test asserting parity between the two attribute sets.
**Recommendation**
Derive both from a single source of truth — e.g. have `ZbResource` expose the canonical
attribute map (already mostly the shape `BuildResourceAttributes` returns) and have the
Serilog sink consume it — or add a parity test that asserts the two attribute sets are
key-for-key identical for a representative options object.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — introduced `ZbResource.BuildAttributes` as the single
source of truth; `ZbResource.Configure` (OTel SDK) and `ZbSerilogConfig.BuildResourceAttributes`
(OTLP log sink) now both derive from it, and a parity test asserts the two sets are identical.
### Telemetry-006 — Malformed `OtlpEndpoint` throws `UriFormatException` late, with no context
| | |
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbTelemetryExtensions.cs:127-135` |
**Description**
`ConfigureOtlp` does `otlp.Endpoint = new Uri(options.OtlpEndpoint)` with no validation. A
malformed endpoint string (typo, missing scheme) throws a raw `UriFormatException` deep
inside OTel exporter construction at host-build time, with no mention of which option was at
fault. `BuildOptions` already fails fast and clearly for a missing `ServiceName`, but does
not validate that `OtlpEndpoint` is a well-formed absolute URI when `Exporter == Otlp` (nor
that it is non-empty — an Otlp exporter with a null endpoint is silently registered and
points nowhere). The Serilog path (`ZbSerilogConfig`) has the same untyped string→endpoint
handoff.
**Recommendation**
In `BuildOptions`, when `Exporter == ZbExporter.Otlp`, validate `OtlpEndpoint` with
`Uri.TryCreate(..., UriKind.Absolute, out _)` and throw an `ArgumentException` naming the
option (consistent with the existing `ServiceName` guard) rather than letting a bare
`UriFormatException` escape later.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — added `ZbTelemetryOptionsValidator.Validate`, called from
both `BuildOptions` and `AddZbSerilog`: when `Exporter == Otlp` it requires a non-empty,
well-formed absolute `OtlpEndpoint` and throws a named `ArgumentException` (no-op for Prometheus);
covered by three new tests.
### Telemetry-007 — Redaction snapshot allocates a dictionary on every log event
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-67` |
**Description**
When an `ILogRedactor` is registered, `Enrich` allocates a new
`Dictionary<string, object?>(logEvent.Properties.Count)`, copies every property into it, and
then iterates again to diff/write-back — on every single log event, across every logging
thread. Enrichers are on the hottest path in the library (they run for each event the level
filter admits). The early-return when no redactor is registered keeps the common case free,
so the cost is borne only by redaction-enabled consumers (MxGateway), but for a high-volume
gateway this is non-trivial steady-state allocation/GC pressure.
**Recommendation**
Consider redacting in place against `logEvent.Properties` without a full snapshot copy (e.g.
only materialize replacements for keys the redactor touches), or short-circuit when the event
has no properties. At minimum, document the per-event cost so consumers can weigh enabling
redaction on very hot loggers. Acceptable as-is given redaction is opt-in and security-first.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — `Enrich` now short-circuits before any snapshot allocation
when the event has no properties (and still early-returns when no redactor is registered), so the
per-event dictionary copy is only paid when there is actually something to redact.
### Telemetry-008 — `MapZbMetrics` XML doc claims it is "only valid when Exporter = Prometheus" — stale
| | |
|--|--|
| Severity | Low |
| Category | Documentation & XML docs |
| Status | Resolved |
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbMetricsEndpointExtensions.cs:11-14` |
**Description**
The XML doc on `MapZbMetrics` states it is "Only valid when
`ZbTelemetryOptions.Exporter = ZbExporter.Prometheus`." That contradicts the actual wiring:
`ApplyMetricsExporter` (`ZbTelemetryExtensions.cs:107-116`) **always** calls
`AddPrometheusExporter()` regardless of the `Exporter` setting — OTLP is purely additive.
The library's own README ("Prometheus is **always wired** for metrics regardless of the
`Exporter` setting") and the test `AddZbTelemetry_OtlpExporter_StillServesPrometheusEndpoint`
both confirm `/metrics` works under `Exporter = Otlp`. The doc-comment therefore tells
consumers the opposite of the real (and intended) behaviour and could lead them to wrongly
believe `MapZbMetrics` is a no-op under OTLP. The same stale wording is mirrored in the
shared contract (`ZB.MOM.WW.Telemetry.md`, `MapZbMetrics` summary).
**Recommendation**
Update the doc-comment to state that the Prometheus exporter is always registered and
`MapZbMetrics` is valid under any `Exporter` value (Prometheus is always-on; OTLP is an
overlay). Align the shared-contract summary for `MapZbMetrics` to match.
**Resolution**
Resolved in `544a6dd`, 2026-06-01 — rewrote the `MapZbMetrics` XML doc to state it is valid
under any `Exporter` value (Prometheus always-on; OTLP additive overlay) and aligned the matching
shared-contract summary.