Files
scadaproj/code-reviews/Telemetry/findings.md
T
Joseph Doherty 7ae25f8510 Re-stamp Telemetry-002/003 resolutions: nested redaction implemented in 05cc62a
Telemetry-002 was first resolved by documenting the scalar-only limitation; it is now
implemented (recursive nested redaction). Updated the two resolution notes to record
05cc62a and the replaced limitation test, preserving the audit trail. README unchanged
(still 0 pending / 35 total).
2026-06-01 12:13:05 -04:00

18 KiB

Code Review — Telemetry

Field Value
Library ZB.MOM.WW.Telemetry/
Packages ZB.MOM.WW.Telemetry, ZB.MOM.WW.Telemetry.Serilog
Component spec components/observability/spec/SPEC.md
Shared contract components/observability/shared-contract/ZB.MOM.WW.Telemetry.md
Status Reviewed
Last reviewed 2026-06-01
Reviewer Claude (automated baseline)
Commit reviewed 5f75cd4
Open findings 0

Summary

The library is small, focused, and well-structured: two packages with a clean Serilog/OTel boundary (the Serilog.* stack appears only in the .Serilog package; the core package is pure OTel + ASP.NET Core framework reference), correct argument validation, deliberate sealed types, thorough XML docs, and a deliberate no-process-global-state design for AddZbSerilog that is well covered by MultiHostTests. The identity triple, Resource omission rules, exporter wiring (Prometheus always-on, OTLP additive), and trace/log correlation all match the spec's intent and are exercised by the 19 tests.

The most material problems are in the redaction seam — the one component the review brief flags as security-critical. RedactionEnricher honours only replacement of scalar properties: it silently ignores the redactor removing a key (a documented capability of ILogRedactor), and it cannot see inside destructured/structured property values, so a secret logged as a field of {@Object} is never scrubbed. Both let secrets reach sinks despite a conforming redactor. Secondary themes: a spec drift around an undocumented service.instance.id Resource attribute, two hand-maintained Resource-attribute builders that can drift apart, and a stale doc-comment on MapZbMetrics. Tests are solid for the happy paths but have no coverage for redactor removal or structured-value redaction.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Redactor "remove" path is a no-op (Telemetry-001); structured values opaque to redactor (Telemetry-002).
2 Public API surface & compatibility Surface minimal, sealed, nullable-correct. ZbResource.InstanceId is an added public member not in the contract (Telemetry-004).
3 Concurrency & thread safety No issues found. Enrichers stateless; Lazy uses ExecutionAndPublication; Activity.Current is async-local.
4 Error handling & resilience Guard clauses present. new Uri(OtlpEndpoint) can throw late on malformed input (Telemetry-006).
5 Security & secret handling Redaction gaps (Telemetry-001/002) are security-relevant — secrets can survive a conforming redactor.
6 Performance & resource management Per-event dictionary snapshot when a redactor is registered (Telemetry-007); acceptable but noted.
7 Spec & shared-contract adherence Undocumented service.instance.id attribute (Telemetry-004); two Resource builders that can drift (Telemetry-005).
8 Packaging, dependencies & project layout No issues found. Serilog stack confined to .Serilog; central versions; correct net10.0; framework ref justified.
9 Testing coverage No tests for redactor removal or structured-value redaction (Telemetry-003).
10 Documentation & XML docs MapZbMetrics doc-comment is stale: claims "only valid when Exporter = Prometheus" (Telemetry-008).

Findings

Telemetry-001 — RedactionEnricher ignores property removal, leaving secrets in the event

Severity High
Category Security & secret handling
Status Resolved
Location ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-67

Description

ILogRedactor.Redact is documented to let a project "remove or replace any sensitive values" (ILogRedactor.cs:13 and the XML doc on the interface method: "remove or replace"; the shared contract repeats "Remove or replace any sensitive values"). RedactionEnricher builds a snapshot dictionary, hands it to the redactor, then writes back only via AddOrUpdateProperty for entries that remain in the snapshot and HasChanged:

foreach (var entry in snapshot)
{
    if (HasChanged(logEvent, entry.Key, entry.Value))
        logEvent.AddOrUpdateProperty(propertyFactory.CreateProperty(entry.Key, entry.Value));
}

If a redactor removes a key from the dictionary (properties.Remove("apiKey")) — the most natural way to implement "must not leave the process" — that key simply no longer appears in the write-back loop, so the original property is never removed from logEvent. The secret reaches every sink unredacted, even though the redactor did exactly what its contract permits. This defeats the seam's stated operational guarantee ("secrets never leave the process in log events") for any removal-style redactor.

Recommendation

After calling the redactor, reconcile deletions: for each property key present on the original logEvent but absent from the returned snapshot, call logEvent.RemovePropertyIfPresent(key). (Capture the original key set before mutation, then diff.) Add a test asserting a removing redactor scrubs the property (see Telemetry-003).

Resolution

Resolved in 544a6dd, 2026-06-01 — RedactionEnricher now captures the original property key set and calls RemovePropertyIfPresent for any key the redactor dropped from the snapshot, so a removing redactor scrubs the property; covered by a new removing-redactor test.

Telemetry-002 — Redactor cannot inspect or scrub destructured/structured property values

Severity Medium
Category Security & secret handling
Status Resolved
Location ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-55

Description

The snapshot only unwraps ScalarValue; every other LogEventPropertyValue (StructureValue from {@Object}, SequenceValue, DictionaryValue) is passed to the redactor as the raw Serilog wrapper object:

snapshot[property.Key] = property.Value is ScalarValue scalar ? scalar.Value : property.Value;

A project redactor written against the seam (IDictionary<string, object?> of "values") therefore sees an opaque StructureValue for a destructured payload — it cannot read or mask a secret field inside a logged object (e.g. logger.Information("{@Command}", cmd) where cmd.ApiKey is sensitive). MxGateway's reference redactor specifically guards "which command payloads must not leave the process" (per ILogRedactor XML doc and the contract), which is precisely the destructured-object case. The seam silently cannot meet that requirement; the redactor only works for top-level scalar properties.

Recommendation

Document the seam's actual reach (scalar top-level properties only) on ILogRedactor and in the shared contract, and/or recursively project StructureValue/SequenceValue/ DictionaryValue into the snapshot and rebuild them on write-back so nested fields are redactable. At minimum, make the limitation explicit so consumers do not assume nested payloads are scrubbed when they are not.

Resolution

Resolved in 544a6dd, 2026-06-01 (documented the scalar-only limitation), then superseded by 05cc62a, 2026-06-01 — nested redaction implemented. RedactionEnricher now projects each structured value into a mutable nested view the redactor descends into recursively (StructureValueIDictionary<string,object?>, SequenceValueIList<object?>, DictionaryValueIDictionary<string,object?>), so a field nested inside a {@Object} can be masked or removed. The Project/Rebuild round-trip preserves StructureValue.TypeTag and original dictionary keys, and a structural ValueEquals skips write-back for properties the redactor left untouched (no reallocation; scalar fast path retained). The earlier documented-limitation wording on the ILogRedactor XML doc, shared contract, and README was replaced to document the recursive reach.

Telemetry-003 — No tests for redactor removal or structured-value redaction

Severity Medium
Category Testing coverage
Status Resolved
Location ZB.MOM.WW.Telemetry/tests/ZB.MOM.WW.Telemetry.Serilog.Tests/RedactionTests.cs:33-69

Description

RedactionTests covers exactly two redaction behaviours: a registered redactor replacing a scalar value, and a no-op when none is registered. The FakeRedactor only ever reassigns properties["apiKey"]. There is no test that a redactor which removes a key actually scrubs it (the Telemetry-001 defect would have been caught), and no test that a redactor can mask a field of a destructured/structured property (Telemetry-002). For a seam whose entire purpose is secret containment, the most security-relevant behaviours are untested.

Recommendation

Add tests: (a) a redactor calling properties.Remove(key) results in the property being absent from the emitted LogEvent; (b) a redactor attempting to mask a nested field of a {@Object} payload, asserting the documented behaviour (whichever resolution Telemetry-002 takes). These should fail today and pin the fixes.

Resolution

Resolved in 544a6dd, 2026-06-01, then extended in 05cc62a — added Removing_redactor_scrubs_the_property_from_the_event (red→green for Telemetry-001), a Resource-attribute parity test, and (for the Telemetry-002 implementation) a nested-reach suite: mask and remove a field inside a destructured {@Object}, mask a sequence element, mask a dictionary value, mask a field two levels deep, and an untouched-structure-survives check. The earlier Redactor_cannot_reach_a_field_inside_a_destructured_object limitation test was replaced.

Telemetry-004 — service.instance.id Resource attribute is undocumented in spec and contract

Severity Low
Category Spec & shared-contract adherence
Status Resolved
Location ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbResource.cs:19-45

Description

ZbResource adds a service.instance.id attribute (deterministic MachineName:ProcessId) to the Resource, and exposes it as a new public member ZbResource.InstanceId. The normalized Resource attribute set is enumerated exhaustively in two authoritative docs — SPEC.md §2 and METRIC-CONVENTIONS.md §4 — and neither lists service.instance.id; the shared contract (ZB.MOM.WW.Telemetry.md) likewise documents ZbResource.Build as populating only service.name/namespace/version/site.id/node.role/host.name and does not mention an InstanceId member. The attribute itself is a reasonable, standards-aligned improvement (and disabling the OTel SDK's random-GUID default is sensible for cross-signal correlation), but it is a silent divergence: the spec/contract are now stale relative to the code. Per REVIEW-PROCESS §2.7, both directions of drift must be flagged.

Recommendation

Add service.instance.id (with the MachineName:ProcessId rationale) to the Resource table in SPEC.md §2 and METRIC-CONVENTIONS.md §4, and document the public ZbResource.InstanceId member in the shared contract, so the normalized spec and the code agree.

Resolution

Resolved in 544a6dd, 2026-06-01 — kept the attribute (documented the MachineName:ProcessId rationale) and added service.instance.id to the Resource tables in SPEC.md §2 and METRIC-CONVENTIONS.md §4, plus the ZbResource.InstanceId member to the shared contract; spec and code now agree.

Telemetry-005 — Two hand-maintained Resource-attribute builders can silently drift

Severity Low
Category Spec & shared-contract adherence
Status Resolved
Location ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbResource.cs:38-64, ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/ZbSerilogConfig.cs:125-151

Description

The Resource attached to metrics/traces is built by ZbResource.Configure (via the OTel AddService + AddAttributes API), while the Resource attached to the OTLP log sink is built independently by ZbSerilogConfig.BuildResourceAttributes (a hand-rolled Dictionary<string, object>). The two currently agree, but they enumerate the same six/seven attributes in two places with two different mechanisms, so a future change to one (a new attribute, a renamed key, a changed omission rule) will silently desynchronize logs from metrics/traces and break the cross-signal correlation the library's whole "unifying hinge" depends on. There is no test asserting parity between the two attribute sets.

Recommendation

Derive both from a single source of truth — e.g. have ZbResource expose the canonical attribute map (already mostly the shape BuildResourceAttributes returns) and have the Serilog sink consume it — or add a parity test that asserts the two attribute sets are key-for-key identical for a representative options object.

Resolution

Resolved in 544a6dd, 2026-06-01 — introduced ZbResource.BuildAttributes as the single source of truth; ZbResource.Configure (OTel SDK) and ZbSerilogConfig.BuildResourceAttributes (OTLP log sink) now both derive from it, and a parity test asserts the two sets are identical.

Telemetry-006 — Malformed OtlpEndpoint throws UriFormatException late, with no context

Severity Low
Category Error handling & resilience
Status Resolved
Location ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbTelemetryExtensions.cs:127-135

Description

ConfigureOtlp does otlp.Endpoint = new Uri(options.OtlpEndpoint) with no validation. A malformed endpoint string (typo, missing scheme) throws a raw UriFormatException deep inside OTel exporter construction at host-build time, with no mention of which option was at fault. BuildOptions already fails fast and clearly for a missing ServiceName, but does not validate that OtlpEndpoint is a well-formed absolute URI when Exporter == Otlp (nor that it is non-empty — an Otlp exporter with a null endpoint is silently registered and points nowhere). The Serilog path (ZbSerilogConfig) has the same untyped string→endpoint handoff.

Recommendation

In BuildOptions, when Exporter == ZbExporter.Otlp, validate OtlpEndpoint with Uri.TryCreate(..., UriKind.Absolute, out _) and throw an ArgumentException naming the option (consistent with the existing ServiceName guard) rather than letting a bare UriFormatException escape later.

Resolution

Resolved in 544a6dd, 2026-06-01 — added ZbTelemetryOptionsValidator.Validate, called from both BuildOptions and AddZbSerilog: when Exporter == Otlp it requires a non-empty, well-formed absolute OtlpEndpoint and throws a named ArgumentException (no-op for Prometheus); covered by three new tests.

Telemetry-007 — Redaction snapshot allocates a dictionary on every log event

Severity Low
Category Performance & resource management
Status Resolved
Location ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-67

Description

When an ILogRedactor is registered, Enrich allocates a new Dictionary<string, object?>(logEvent.Properties.Count), copies every property into it, and then iterates again to diff/write-back — on every single log event, across every logging thread. Enrichers are on the hottest path in the library (they run for each event the level filter admits). The early-return when no redactor is registered keeps the common case free, so the cost is borne only by redaction-enabled consumers (MxGateway), but for a high-volume gateway this is non-trivial steady-state allocation/GC pressure.

Recommendation

Consider redacting in place against logEvent.Properties without a full snapshot copy (e.g. only materialize replacements for keys the redactor touches), or short-circuit when the event has no properties. At minimum, document the per-event cost so consumers can weigh enabling redaction on very hot loggers. Acceptable as-is given redaction is opt-in and security-first.

Resolution

Resolved in 544a6dd, 2026-06-01 — Enrich now short-circuits before any snapshot allocation when the event has no properties (and still early-returns when no redactor is registered), so the per-event dictionary copy is only paid when there is actually something to redact.

Telemetry-008 — MapZbMetrics XML doc claims it is "only valid when Exporter = Prometheus" — stale

Severity Low
Category Documentation & XML docs
Status Resolved
Location ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbMetricsEndpointExtensions.cs:11-14

Description

The XML doc on MapZbMetrics states it is "Only valid when ZbTelemetryOptions.Exporter = ZbExporter.Prometheus." That contradicts the actual wiring: ApplyMetricsExporter (ZbTelemetryExtensions.cs:107-116) always calls AddPrometheusExporter() regardless of the Exporter setting — OTLP is purely additive. The library's own README ("Prometheus is always wired for metrics regardless of the Exporter setting") and the test AddZbTelemetry_OtlpExporter_StillServesPrometheusEndpoint both confirm /metrics works under Exporter = Otlp. The doc-comment therefore tells consumers the opposite of the real (and intended) behaviour and could lead them to wrongly believe MapZbMetrics is a no-op under OTLP. The same stale wording is mirrored in the shared contract (ZB.MOM.WW.Telemetry.md, MapZbMetrics summary).

Recommendation

Update the doc-comment to state that the Prometheus exporter is always registered and MapZbMetrics is valid under any Exporter value (Prometheus is always-on; OTLP is an overlay). Align the shared-contract summary for MapZbMetrics to match.

Resolution

Resolved in 544a6dd, 2026-06-01 — rewrote the MapZbMetrics XML doc to state it is valid under any Exporter value (Prometheus always-on; OTLP additive overlay) and aligned the matching shared-contract summary.