CBDDC/docs/peer-deprecation-removal-runbook.md

# Peer Deprecation & Removal Runbook

Operational workflow for safely deprecating or removing peers in clusters using
peer-confirmed pruning.

## When to use this runbook

- A site is permanently decommissioned.
- A peer has been unreachable long enough to block prune progress.
- A peer is being replaced and should stop gating prune decisions.

## Decision matrix

| Scenario | Action |
|------|-----------|
| Peer is temporarily offline and expected to return soon | Keep tracking; monitor lag and confirmations. |
| Peer should stay configured but must stop gating pruning | `RemovePeerTrackingAsync(nodeId, removeRemoteConfig: false)` |
| Peer is permanently removed from topology | `RemoveRemotePeerAsync(nodeId)` |

## Procedure

1. Confirm peer intent (temporary outage vs. decommission).
2. Inspect health payload:
   - `peersWithNoConfirmation`
   - `laggingPeers`
   - `lastSuccessfulConfirmationUpdateByPeer`
3. If deprecating from prune gating only, run:

```csharp
await peerManagement.RemovePeerTrackingAsync(
    nodeId: "peer-to-deprecate",
    removeRemoteConfig: false,
    cancellationToken);
```

4. If permanently removing, run:

```csharp
await peerManagement.RemoveRemotePeerAsync("peer-to-remove", cancellationToken);
```

5. Re-check `/health` and verify status transition:
   - `Degraded`/`Unhealthy` should clear if the removed peer was the cause.
6. Confirm maintenance logs no longer report prune blocking for that peer.

## Post-change verification

- Peer no longer appears in active tracked peers.
- `maxLagMs` trends with only current active peers.
- Pruning resumes with a valid effective cutoff (or a known non-peer reason).

## Emergency path

If pruning is blocked and storage pressure is high:

1. De-track the clearly retired peer first (`removeRemoteConfig: false`).
2. Validate pruning resumes.
3. Perform full peer removal after change-control approval.

## Re-activation path

If a deprecated peer returns and should gate pruning again:

1. Ensure peer config is enabled/available.
2. Allow sync to re-register and refresh confirmations.
3. Watch health payload until peer exits `peersWithNoConfirmation`.