docs+code: close Theme 1 — 24 design-doc / XML-doc drift findings
Doc/XML-comment drift + small adherence fixes across 17 modules. Highlights: - Host-017: site CoordinatedShutdown ordering — SiteStreamGrpcServer gains CancelAllStreams() (refuse new streams, cancel active), wired into Program.cs site branch via ApplicationStopping. - InboundAPI-021: ParentExecutionId now travels on RouteToGet/SetAttributes symmetric with RouteToCallRequest; RouteHelper stamps from _parentExecutionId. - ClusterInfra-012: ClusterOptionsValidator now requires both seed nodes. - Comm-018: SiteCommunicationActor.HeartbeatMessage.IsActive derived from cluster leader check (was hardcoded true). - DM-020: reconciliation audit row attributes the current user, not prior deployer. - SEL-019: EventLogPurgeService early-exits on standby via active-node check. - Plus comment/XML-doc accuracy fixes across AuditLog, ConfigurationDatabase, NotificationOutbox, SiteRuntime, SiteCallAudit; doc refreshes for Component- Commons / -ManagementService / -CLI / -ExternalSystemGateway / -HealthMonitoring / -Transport / -ConfigurationDatabase; CD-023 index-name doc alignment. 11 new regression tests (RouteHelper x4, SiteStreamGrpcServer x2, ClusterOptionsValidator x1, SiteCommunicationActor x1, DeploymentService x1, EventLogPurgeService x3). Build clean (0 warnings); InboundAPI/Communication/ Host suites all green. README regenerated: 112 open (was 136).
This commit is contained in:
@@ -36,8 +36,6 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
|
||||
| Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported |
|
||||
| `SiteAuditBacklog` | Audit Log (site) | Count of `Pending` rows in the site-local `AuditLog` plus oldest-pending-age plus on-disk bytes. A configurable threshold drives a Health dashboard warning on the affected site tile. |
|
||||
| `SiteAuditWriteFailures` | Audit Log (site) | Count of failed hot-path audit appends at the site since the last health report. |
|
||||
| `SiteAuditTelemetryStalled` | Audit Log (site) | Boolean flag set when reconciliation reports a non-draining site-local audit backlog over two consecutive cycles. |
|
||||
| `CentralAuditWriteFailures` | Audit Log (central) | Count of central direct-write audit failures (Inbound API middleware, Notification Outbox dispatcher, and any other central direct writers) since the last interval. |
|
||||
| `AuditRedactionFailure` | Audit Log (central) | Count of payload redactor errors (over-redacted payloads, safety-net hit) since the last interval. |
|
||||
|
||||
## Reporting Protocol
|
||||
@@ -86,10 +84,10 @@ Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** —
|
||||
The Audit Log spans both sites (hot-path append + telemetry forward) and central (direct-write + ingest + redaction). Its operational health surfaces as three new dashboard tiles grouped under **Audit**:
|
||||
|
||||
- **Audit volume** — events/min landing in the central `AuditLog` table, shown global plus per-site sparkline; sourced from the Audit Log component on the active central node.
|
||||
- **Audit error rate** — percent of central `AuditLog` rows with `Status` other than `Success` / `Delivered` / `Enqueued` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries, etc.) — NOT the audit writer's own health. Audit-writer issues surface separately via `CentralAuditWriteFailures` and `AuditRedactionFailure`.
|
||||
- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes); click drills into a per-site breakdown. The per-site tile surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set.
|
||||
- **Audit error rate** — percent of central `AuditLog` rows with `Status` other than `Success` / `Delivered` / `Enqueued` over a rolling 5-minute window. This is the operational error rate of audited operations (HTTP 5xx, transient failures, parked deliveries, etc.) — NOT the audit writer's own health. Audit-writer issues surface separately via `AuditRedactionFailure`.
|
||||
- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes); click drills into a per-site breakdown. The per-site tile surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold.
|
||||
|
||||
These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics arrive in the existing site health report; the central-scoped `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics are central-computed alongside the existing central KPIs.
|
||||
These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` metrics arrive in the existing site health report; the central-scoped `AuditRedactionFailure` metric is central-computed alongside the existing central KPIs.
|
||||
|
||||
## Central Storage
|
||||
|
||||
@@ -112,7 +110,7 @@ These tiles are **point-in-time** like the Notification Outbox and Site Call Aud
|
||||
- **Cluster Infrastructure (site)**: Provides node role status.
|
||||
- **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles.
|
||||
- **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles.
|
||||
- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics (via the site health report) and the central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog).
|
||||
- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` metrics (via the site health report) and the central-computed `AuditRedactionFailure` metric, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog).
|
||||
|
||||
## Interactions
|
||||
|
||||
|
||||
Reference in New Issue
Block a user