Files
CBDDC/docs/upgrade-peer-confirmed-pruning.md
Joseph Doherty ce727eb30d
All checks were successful
CI / verify (push) Successful in 2m33s
docs: align internal docs to enterprise standards
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00

2.3 KiB

Upgrade Notes: Peer-Confirmed Pruning

This guide covers adopting peer-confirmed pruning semantics introduced across Phases 1-4.

What changed

  1. Oplog pruning now uses an effective cutoff:
    • min(retention cutoff, confirmation cutoff) when peer confirmations are complete.
    • prune is skipped for that cycle when an active tracked peer is missing required confirmation.
  2. Peer tracking is now a managed lifecycle:
    • RemovePeerTrackingAsync(nodeId, removeRemoteConfig: false) deprecates a peer from prune gating.
    • RemoveRemotePeerAsync(nodeId) removes both static peer config and tracking.
  3. Hosting health now includes confirmation lag semantics:
    • Degraded for lagging or unconfirmed tracked peers.
    • Unhealthy for critical lag or storage failures.

Upgrade impact to expect

  • During initial rollout, a peer may appear in peersWithNoConfirmation until the first successful confirmation update.
  • Any stale active tracked peer can block prune progress and/or keep health degraded.
  1. Upgrade one node and validate health payload and pruning logs.
  2. Upgrade remaining nodes in the cluster.
  3. Audit peer inventory and remove/deprecate stale peers.
  4. Tune lag thresholds after observing normal confirmation latency.

Peer inventory and cleanup

List configured peers:

var peers = await peerManagement.GetAllRemotePeersAsync(cancellationToken);

Deprecate from pruning only:

await peerManagement.RemovePeerTrackingAsync(
    nodeId: "retired-peer",
    removeRemoteConfig: false,
    cancellationToken);

Fully remove peer + tracking:

await peerManagement.RemoveRemotePeerAsync("retired-peer", cancellationToken);

Validation checklist

  • /health returns Healthy or expected transient Degraded during warm-up.
  • laggingPeers and peersWithNoConfirmation converge toward zero for active peers.
  • Maintenance logs no longer report prune skip reasons for retired peers.

Rollback/mitigation

If rollout exposes unexpected persistent degradation:

  1. Remove tracking for permanently retired peers.
  2. Temporarily raise lag thresholds to reduce alert noise while investigating.
  3. Keep full peer removal for nodes that are confirmed decommissioned.

For a detailed operator procedure, see Peer Deprecation & Removal Runbook.