7ae25f8510
Telemetry-002 was first resolved by documenting the scalar-only limitation; it is now
implemented (recursive nested redaction). Updated the two resolution notes to record
05cc62a and the replaced limitation test, preserving the audit trail. README unchanged
(still 0 pending / 35 total).
345 lines
18 KiB
Markdown
345 lines
18 KiB
Markdown
# Code Review — Telemetry
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Library | `ZB.MOM.WW.Telemetry/` |
|
|
| Packages | `ZB.MOM.WW.Telemetry`, `ZB.MOM.WW.Telemetry.Serilog` |
|
|
| Component spec | `components/observability/spec/SPEC.md` |
|
|
| Shared contract | `components/observability/shared-contract/ZB.MOM.WW.Telemetry.md` |
|
|
| Status | Reviewed |
|
|
| Last reviewed | 2026-06-01 |
|
|
| Reviewer | Claude (automated baseline) |
|
|
| Commit reviewed | `5f75cd4` |
|
|
| Open findings | 0 |
|
|
|
|
## Summary
|
|
|
|
The library is small, focused, and well-structured: two packages with a clean Serilog/OTel
|
|
boundary (the `Serilog.*` stack appears only in the `.Serilog` package; the core package is
|
|
pure OTel + ASP.NET Core framework reference), correct argument validation, deliberate
|
|
`sealed` types, thorough XML docs, and a deliberate no-process-global-state design for
|
|
`AddZbSerilog` that is well covered by `MultiHostTests`. The identity triple, Resource
|
|
omission rules, exporter wiring (Prometheus always-on, OTLP additive), and trace/log
|
|
correlation all match the spec's intent and are exercised by the 19 tests.
|
|
|
|
The most material problems are in the redaction seam — the one component the review brief
|
|
flags as security-critical. `RedactionEnricher` honours only *replacement* of scalar
|
|
properties: it silently ignores the redactor **removing** a key (a documented capability of
|
|
`ILogRedactor`), and it cannot see inside destructured/structured property values, so a
|
|
secret logged as a field of `{@Object}` is never scrubbed. Both let secrets reach sinks
|
|
despite a conforming redactor. Secondary themes: a spec drift around an undocumented
|
|
`service.instance.id` Resource attribute, two hand-maintained Resource-attribute builders
|
|
that can drift apart, and a stale doc-comment on `MapZbMetrics`. Tests are solid for the
|
|
happy paths but have no coverage for redactor removal or structured-value redaction.
|
|
|
|
## Checklist coverage
|
|
|
|
| # | Category | Examined | Notes |
|
|
|---|----------|----------|-------|
|
|
| 1 | Correctness & logic bugs | ☑ | Redactor "remove" path is a no-op (Telemetry-001); structured values opaque to redactor (Telemetry-002). |
|
|
| 2 | Public API surface & compatibility | ☑ | Surface minimal, `sealed`, nullable-correct. `ZbResource.InstanceId` is an added public member not in the contract (Telemetry-004). |
|
|
| 3 | Concurrency & thread safety | ☑ | No issues found. Enrichers stateless; `Lazy` uses `ExecutionAndPublication`; `Activity.Current` is async-local. |
|
|
| 4 | Error handling & resilience | ☑ | Guard clauses present. `new Uri(OtlpEndpoint)` can throw late on malformed input (Telemetry-006). |
|
|
| 5 | Security & secret handling | ☑ | Redaction gaps (Telemetry-001/002) are security-relevant — secrets can survive a conforming redactor. |
|
|
| 6 | Performance & resource management | ☑ | Per-event dictionary snapshot when a redactor is registered (Telemetry-007); acceptable but noted. |
|
|
| 7 | Spec & shared-contract adherence | ☑ | Undocumented `service.instance.id` attribute (Telemetry-004); two Resource builders that can drift (Telemetry-005). |
|
|
| 8 | Packaging, dependencies & project layout | ☑ | No issues found. Serilog stack confined to `.Serilog`; central versions; correct net10.0; framework ref justified. |
|
|
| 9 | Testing coverage | ☑ | No tests for redactor removal or structured-value redaction (Telemetry-003). |
|
|
| 10 | Documentation & XML docs | ☑ | `MapZbMetrics` doc-comment is stale: claims "only valid when Exporter = Prometheus" (Telemetry-008). |
|
|
|
|
## Findings
|
|
|
|
### Telemetry-001 — `RedactionEnricher` ignores property removal, leaving secrets in the event
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | High |
|
|
| Category | Security & secret handling |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-67` |
|
|
|
|
**Description**
|
|
|
|
`ILogRedactor.Redact` is documented to let a project "remove or replace any sensitive
|
|
values" (`ILogRedactor.cs:13` and the XML doc on the interface method: *"remove or replace"*;
|
|
the shared contract repeats *"Remove or replace any sensitive values"*). `RedactionEnricher`
|
|
builds a `snapshot` dictionary, hands it to the redactor, then writes back only via
|
|
`AddOrUpdateProperty` for entries that remain in the snapshot and `HasChanged`:
|
|
|
|
```csharp
|
|
foreach (var entry in snapshot)
|
|
{
|
|
if (HasChanged(logEvent, entry.Key, entry.Value))
|
|
logEvent.AddOrUpdateProperty(propertyFactory.CreateProperty(entry.Key, entry.Value));
|
|
}
|
|
```
|
|
|
|
If a redactor *removes* a key from the dictionary (`properties.Remove("apiKey")`) — the most
|
|
natural way to implement "must not leave the process" — that key simply no longer appears in
|
|
the write-back loop, so the original property is **never removed from `logEvent`**. The
|
|
secret reaches every sink unredacted, even though the redactor did exactly what its contract
|
|
permits. This defeats the seam's stated operational guarantee ("secrets never leave the
|
|
process in log events") for any removal-style redactor.
|
|
|
|
**Recommendation**
|
|
|
|
After calling the redactor, reconcile deletions: for each property key present on the
|
|
original `logEvent` but absent from the returned `snapshot`, call
|
|
`logEvent.RemovePropertyIfPresent(key)`. (Capture the original key set before mutation, then
|
|
diff.) Add a test asserting a removing redactor scrubs the property (see Telemetry-003).
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — `RedactionEnricher` now captures the original property
|
|
key set and calls `RemovePropertyIfPresent` for any key the redactor dropped from the snapshot,
|
|
so a removing redactor scrubs the property; covered by a new removing-redactor test.
|
|
|
|
### Telemetry-002 — Redactor cannot inspect or scrub destructured/structured property values
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Security & secret handling |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-55` |
|
|
|
|
**Description**
|
|
|
|
The snapshot only unwraps `ScalarValue`; every other `LogEventPropertyValue`
|
|
(`StructureValue` from `{@Object}`, `SequenceValue`, `DictionaryValue`) is passed to the
|
|
redactor as the raw Serilog wrapper object:
|
|
|
|
```csharp
|
|
snapshot[property.Key] = property.Value is ScalarValue scalar ? scalar.Value : property.Value;
|
|
```
|
|
|
|
A project redactor written against the seam (`IDictionary<string, object?>` of "values")
|
|
therefore sees an opaque `StructureValue` for a destructured payload — it cannot read or
|
|
mask a secret *field inside* a logged object (e.g. `logger.Information("{@Command}", cmd)`
|
|
where `cmd.ApiKey` is sensitive). MxGateway's reference redactor specifically guards
|
|
"which command payloads must not leave the process" (per `ILogRedactor` XML doc and the
|
|
contract), which is precisely the destructured-object case. The seam silently cannot meet
|
|
that requirement; the redactor only works for top-level scalar properties.
|
|
|
|
**Recommendation**
|
|
|
|
Document the seam's actual reach (scalar top-level properties only) on `ILogRedactor` and in
|
|
the shared contract, and/or recursively project `StructureValue`/`SequenceValue`/
|
|
`DictionaryValue` into the snapshot and rebuild them on write-back so nested fields are
|
|
redactable. At minimum, make the limitation explicit so consumers do not assume nested
|
|
payloads are scrubbed when they are not.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 (documented the scalar-only limitation), then **superseded by
|
|
`05cc62a`, 2026-06-01 — nested redaction implemented.** `RedactionEnricher` now projects each
|
|
structured value into a mutable nested view the redactor descends into recursively
|
|
(`StructureValue` → `IDictionary<string,object?>`, `SequenceValue` → `IList<object?>`,
|
|
`DictionaryValue` → `IDictionary<string,object?>`), so a field nested inside a `{@Object}` can be
|
|
masked or removed. The Project/Rebuild round-trip preserves `StructureValue.TypeTag` and original
|
|
dictionary keys, and a structural `ValueEquals` skips write-back for properties the redactor left
|
|
untouched (no reallocation; scalar fast path retained). The earlier documented-limitation wording on
|
|
the `ILogRedactor` XML doc, shared contract, and README was replaced to document the recursive reach.
|
|
|
|
### Telemetry-003 — No tests for redactor removal or structured-value redaction
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Medium |
|
|
| Category | Testing coverage |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/tests/ZB.MOM.WW.Telemetry.Serilog.Tests/RedactionTests.cs:33-69` |
|
|
|
|
**Description**
|
|
|
|
`RedactionTests` covers exactly two redaction behaviours: a registered redactor replacing a
|
|
scalar value, and a no-op when none is registered. The `FakeRedactor` only ever *reassigns*
|
|
`properties["apiKey"]`. There is no test that a redactor which **removes** a key actually
|
|
scrubs it (the Telemetry-001 defect would have been caught), and no test that a redactor can
|
|
mask a field of a destructured/structured property (Telemetry-002). For a seam whose entire
|
|
purpose is secret containment, the most security-relevant behaviours are untested.
|
|
|
|
**Recommendation**
|
|
|
|
Add tests: (a) a redactor calling `properties.Remove(key)` results in the property being
|
|
absent from the emitted `LogEvent`; (b) a redactor attempting to mask a nested field of a
|
|
`{@Object}` payload, asserting the documented behaviour (whichever resolution Telemetry-002
|
|
takes). These should fail today and pin the fixes.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01, then extended in `05cc62a` — added
|
|
`Removing_redactor_scrubs_the_property_from_the_event` (red→green for Telemetry-001), a
|
|
Resource-attribute parity test, and (for the Telemetry-002 implementation) a nested-reach suite:
|
|
mask and remove a field inside a destructured `{@Object}`, mask a sequence element, mask a
|
|
dictionary value, mask a field two levels deep, and an untouched-structure-survives check. The
|
|
earlier `Redactor_cannot_reach_a_field_inside_a_destructured_object` limitation test was replaced.
|
|
|
|
### Telemetry-004 — `service.instance.id` Resource attribute is undocumented in spec and contract
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Spec & shared-contract adherence |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbResource.cs:19-45` |
|
|
|
|
**Description**
|
|
|
|
`ZbResource` adds a `service.instance.id` attribute (deterministic `MachineName:ProcessId`)
|
|
to the Resource, and exposes it as a new public member `ZbResource.InstanceId`. The
|
|
normalized Resource attribute set is enumerated exhaustively in two authoritative docs —
|
|
`SPEC.md` §2 and `METRIC-CONVENTIONS.md` §4 — and **neither lists `service.instance.id`**;
|
|
the shared contract (`ZB.MOM.WW.Telemetry.md`) likewise documents `ZbResource.Build` as
|
|
populating only `service.name/namespace/version/site.id/node.role/host.name` and does not
|
|
mention an `InstanceId` member. The attribute itself is a reasonable, standards-aligned
|
|
improvement (and disabling the OTel SDK's random-GUID default is sensible for cross-signal
|
|
correlation), but it is a silent divergence: the spec/contract are now stale relative to the
|
|
code. Per REVIEW-PROCESS §2.7, both directions of drift must be flagged.
|
|
|
|
**Recommendation**
|
|
|
|
Add `service.instance.id` (with the `MachineName:ProcessId` rationale) to the Resource table
|
|
in `SPEC.md` §2 and `METRIC-CONVENTIONS.md` §4, and document the public `ZbResource.InstanceId`
|
|
member in the shared contract, so the normalized spec and the code agree.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — kept the attribute (documented the
|
|
`MachineName:ProcessId` rationale) and added `service.instance.id` to the Resource tables in
|
|
`SPEC.md` §2 and `METRIC-CONVENTIONS.md` §4, plus the `ZbResource.InstanceId` member to the shared
|
|
contract; spec and code now agree.
|
|
|
|
### Telemetry-005 — Two hand-maintained Resource-attribute builders can silently drift
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Spec & shared-contract adherence |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbResource.cs:38-64`, `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/ZbSerilogConfig.cs:125-151` |
|
|
|
|
**Description**
|
|
|
|
The Resource attached to metrics/traces is built by `ZbResource.Configure` (via the OTel
|
|
`AddService` + `AddAttributes` API), while the Resource attached to the OTLP *log* sink is
|
|
built independently by `ZbSerilogConfig.BuildResourceAttributes` (a hand-rolled
|
|
`Dictionary<string, object>`). The two currently agree, but they enumerate the same six/seven
|
|
attributes in two places with two different mechanisms, so a future change to one (a new
|
|
attribute, a renamed key, a changed omission rule) will silently desynchronize logs from
|
|
metrics/traces and break the cross-signal correlation the library's whole "unifying hinge"
|
|
depends on. There is no test asserting parity between the two attribute sets.
|
|
|
|
**Recommendation**
|
|
|
|
Derive both from a single source of truth — e.g. have `ZbResource` expose the canonical
|
|
attribute map (already mostly the shape `BuildResourceAttributes` returns) and have the
|
|
Serilog sink consume it — or add a parity test that asserts the two attribute sets are
|
|
key-for-key identical for a representative options object.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — introduced `ZbResource.BuildAttributes` as the single
|
|
source of truth; `ZbResource.Configure` (OTel SDK) and `ZbSerilogConfig.BuildResourceAttributes`
|
|
(OTLP log sink) now both derive from it, and a parity test asserts the two sets are identical.
|
|
|
|
### Telemetry-006 — Malformed `OtlpEndpoint` throws `UriFormatException` late, with no context
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Error handling & resilience |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbTelemetryExtensions.cs:127-135` |
|
|
|
|
**Description**
|
|
|
|
`ConfigureOtlp` does `otlp.Endpoint = new Uri(options.OtlpEndpoint)` with no validation. A
|
|
malformed endpoint string (typo, missing scheme) throws a raw `UriFormatException` deep
|
|
inside OTel exporter construction at host-build time, with no mention of which option was at
|
|
fault. `BuildOptions` already fails fast and clearly for a missing `ServiceName`, but does
|
|
not validate that `OtlpEndpoint` is a well-formed absolute URI when `Exporter == Otlp` (nor
|
|
that it is non-empty — an Otlp exporter with a null endpoint is silently registered and
|
|
points nowhere). The Serilog path (`ZbSerilogConfig`) has the same untyped string→endpoint
|
|
handoff.
|
|
|
|
**Recommendation**
|
|
|
|
In `BuildOptions`, when `Exporter == ZbExporter.Otlp`, validate `OtlpEndpoint` with
|
|
`Uri.TryCreate(..., UriKind.Absolute, out _)` and throw an `ArgumentException` naming the
|
|
option (consistent with the existing `ServiceName` guard) rather than letting a bare
|
|
`UriFormatException` escape later.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — added `ZbTelemetryOptionsValidator.Validate`, called from
|
|
both `BuildOptions` and `AddZbSerilog`: when `Exporter == Otlp` it requires a non-empty,
|
|
well-formed absolute `OtlpEndpoint` and throws a named `ArgumentException` (no-op for Prometheus);
|
|
covered by three new tests.
|
|
|
|
### Telemetry-007 — Redaction snapshot allocates a dictionary on every log event
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Performance & resource management |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry.Serilog/RedactionEnricher.cs:49-67` |
|
|
|
|
**Description**
|
|
|
|
When an `ILogRedactor` is registered, `Enrich` allocates a new
|
|
`Dictionary<string, object?>(logEvent.Properties.Count)`, copies every property into it, and
|
|
then iterates again to diff/write-back — on every single log event, across every logging
|
|
thread. Enrichers are on the hottest path in the library (they run for each event the level
|
|
filter admits). The early-return when no redactor is registered keeps the common case free,
|
|
so the cost is borne only by redaction-enabled consumers (MxGateway), but for a high-volume
|
|
gateway this is non-trivial steady-state allocation/GC pressure.
|
|
|
|
**Recommendation**
|
|
|
|
Consider redacting in place against `logEvent.Properties` without a full snapshot copy (e.g.
|
|
only materialize replacements for keys the redactor touches), or short-circuit when the event
|
|
has no properties. At minimum, document the per-event cost so consumers can weigh enabling
|
|
redaction on very hot loggers. Acceptable as-is given redaction is opt-in and security-first.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — `Enrich` now short-circuits before any snapshot allocation
|
|
when the event has no properties (and still early-returns when no redactor is registered), so the
|
|
per-event dictionary copy is only paid when there is actually something to redact.
|
|
|
|
### Telemetry-008 — `MapZbMetrics` XML doc claims it is "only valid when Exporter = Prometheus" — stale
|
|
|
|
| | |
|
|
|--|--|
|
|
| Severity | Low |
|
|
| Category | Documentation & XML docs |
|
|
| Status | Resolved |
|
|
| Location | `ZB.MOM.WW.Telemetry/src/ZB.MOM.WW.Telemetry/ZbMetricsEndpointExtensions.cs:11-14` |
|
|
|
|
**Description**
|
|
|
|
The XML doc on `MapZbMetrics` states it is "Only valid when
|
|
`ZbTelemetryOptions.Exporter = ZbExporter.Prometheus`." That contradicts the actual wiring:
|
|
`ApplyMetricsExporter` (`ZbTelemetryExtensions.cs:107-116`) **always** calls
|
|
`AddPrometheusExporter()` regardless of the `Exporter` setting — OTLP is purely additive.
|
|
The library's own README ("Prometheus is **always wired** for metrics regardless of the
|
|
`Exporter` setting") and the test `AddZbTelemetry_OtlpExporter_StillServesPrometheusEndpoint`
|
|
both confirm `/metrics` works under `Exporter = Otlp`. The doc-comment therefore tells
|
|
consumers the opposite of the real (and intended) behaviour and could lead them to wrongly
|
|
believe `MapZbMetrics` is a no-op under OTLP. The same stale wording is mirrored in the
|
|
shared contract (`ZB.MOM.WW.Telemetry.md`, `MapZbMetrics` summary).
|
|
|
|
**Recommendation**
|
|
|
|
Update the doc-comment to state that the Prometheus exporter is always registered and
|
|
`MapZbMetrics` is valid under any `Exporter` value (Prometheus is always-on; OTLP is an
|
|
overlay). Align the shared-contract summary for `MapZbMetrics` to match.
|
|
|
|
**Resolution**
|
|
|
|
Resolved in `544a6dd`, 2026-06-01 — rewrote the `MapZbMetrics` XML doc to state it is valid
|
|
under any `Exporter` value (Prometheus always-on; OTLP additive overlay) and aligned the matching
|
|
shared-contract summary.
|