Files
CBDDC/docs/upgrade-peer-confirmed-pruning.md
Joseph Doherty ce727eb30d
All checks were successful
CI / verify (push) Successful in 2m33s
docs: align internal docs to enterprise standards
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00

70 lines
2.3 KiB
Markdown

# Upgrade Notes: Peer-Confirmed Pruning
This guide covers adopting peer-confirmed pruning semantics introduced across
Phases 1-4.
## What changed
1. Oplog pruning now uses an **effective cutoff**:
- `min(retention cutoff, confirmation cutoff)` when peer confirmations are complete.
- prune is skipped for that cycle when an active tracked peer is missing required confirmation.
2. Peer tracking is now a managed lifecycle:
- `RemovePeerTrackingAsync(nodeId, removeRemoteConfig: false)` deprecates a peer from prune gating.
- `RemoveRemotePeerAsync(nodeId)` removes both static peer config and tracking.
3. Hosting health now includes confirmation lag semantics:
- `Degraded` for lagging or unconfirmed tracked peers.
- `Unhealthy` for critical lag or storage failures.
## Upgrade impact to expect
- During initial rollout, a peer may appear in `peersWithNoConfirmation` until the
first successful confirmation update.
- Any stale active tracked peer can block prune progress and/or keep health degraded.
## Recommended rollout sequence
1. Upgrade one node and validate health payload and pruning logs.
2. Upgrade remaining nodes in the cluster.
3. Audit peer inventory and remove/deprecate stale peers.
4. Tune lag thresholds after observing normal confirmation latency.
## Peer inventory and cleanup
List configured peers:
```csharp
var peers = await peerManagement.GetAllRemotePeersAsync(cancellationToken);
```
Deprecate from pruning only:
```csharp
await peerManagement.RemovePeerTrackingAsync(
nodeId: "retired-peer",
removeRemoteConfig: false,
cancellationToken);
```
Fully remove peer + tracking:
```csharp
await peerManagement.RemoveRemotePeerAsync("retired-peer", cancellationToken);
```
## Validation checklist
- `/health` returns `Healthy` or expected transient `Degraded` during warm-up.
- `laggingPeers` and `peersWithNoConfirmation` converge toward zero for active peers.
- Maintenance logs no longer report prune skip reasons for retired peers.
## Rollback/mitigation
If rollout exposes unexpected persistent degradation:
1. Remove tracking for permanently retired peers.
2. Temporarily raise lag thresholds to reduce alert noise while investigating.
3. Keep full peer removal for nodes that are confirmed decommissioned.
For a detailed operator procedure, see
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md).