All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
70 lines
2.3 KiB
Markdown
70 lines
2.3 KiB
Markdown
# Upgrade Notes: Peer-Confirmed Pruning
|
|
|
|
This guide covers adopting peer-confirmed pruning semantics introduced across
|
|
Phases 1-4.
|
|
|
|
## What changed
|
|
|
|
1. Oplog pruning now uses an **effective cutoff**:
|
|
- `min(retention cutoff, confirmation cutoff)` when peer confirmations are complete.
|
|
- prune is skipped for that cycle when an active tracked peer is missing required confirmation.
|
|
2. Peer tracking is now a managed lifecycle:
|
|
- `RemovePeerTrackingAsync(nodeId, removeRemoteConfig: false)` deprecates a peer from prune gating.
|
|
- `RemoveRemotePeerAsync(nodeId)` removes both static peer config and tracking.
|
|
3. Hosting health now includes confirmation lag semantics:
|
|
- `Degraded` for lagging or unconfirmed tracked peers.
|
|
- `Unhealthy` for critical lag or storage failures.
|
|
|
|
## Upgrade impact to expect
|
|
|
|
- During initial rollout, a peer may appear in `peersWithNoConfirmation` until the
|
|
first successful confirmation update.
|
|
- Any stale active tracked peer can block prune progress and/or keep health degraded.
|
|
|
|
## Recommended rollout sequence
|
|
|
|
1. Upgrade one node and validate health payload and pruning logs.
|
|
2. Upgrade remaining nodes in the cluster.
|
|
3. Audit peer inventory and remove/deprecate stale peers.
|
|
4. Tune lag thresholds after observing normal confirmation latency.
|
|
|
|
## Peer inventory and cleanup
|
|
|
|
List configured peers:
|
|
|
|
```csharp
|
|
var peers = await peerManagement.GetAllRemotePeersAsync(cancellationToken);
|
|
```
|
|
|
|
Deprecate from pruning only:
|
|
|
|
```csharp
|
|
await peerManagement.RemovePeerTrackingAsync(
|
|
nodeId: "retired-peer",
|
|
removeRemoteConfig: false,
|
|
cancellationToken);
|
|
```
|
|
|
|
Fully remove peer + tracking:
|
|
|
|
```csharp
|
|
await peerManagement.RemoveRemotePeerAsync("retired-peer", cancellationToken);
|
|
```
|
|
|
|
## Validation checklist
|
|
|
|
- `/health` returns `Healthy` or expected transient `Degraded` during warm-up.
|
|
- `laggingPeers` and `peersWithNoConfirmation` converge toward zero for active peers.
|
|
- Maintenance logs no longer report prune skip reasons for retired peers.
|
|
|
|
## Rollback/mitigation
|
|
|
|
If rollout exposes unexpected persistent degradation:
|
|
|
|
1. Remove tracking for permanently retired peers.
|
|
2. Temporarily raise lag thresholds to reduce alert noise while investigating.
|
|
3. Keep full peer removal for nodes that are confirmed decommissioned.
|
|
|
|
For a detailed operator procedure, see
|
|
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md).
|