# Plan: Peer-Confirmed Oplog Retention and Peer Sync Health

## Objective
Move from time-only oplog pruning to peer-confirmed pruning so entries are not removed until peers have confirmed them, while also adding peer lifecycle management and health visibility.

## Requested Outcomes
- Oplogs are not cleared until confirmed by peers.
- Each node tracks latest oplog confirmation per peer.
- New peers are automatically registered when they join.
- Deprecated peers can be explicitly removed from tracking.
- Hosting health checks report peer sync status, not only store availability.

## Current Baseline (Codebase)
- Pruning is retention-time based in `SyncOrchestrator` + `IOplogStore.PruneOplogAsync(...)`.
- Push ACK is binary success/fail (`AckResponse`) and does not persist peer confirmation state.
- Peer discovery exists (`IDiscoveryService`, `UdpDiscoveryService`, `CompositeDiscoveryService`).
- Persistent remote peer config exists (`IPeerConfigurationStore`, `PeerManagementService`).
- Hosting health check only validates oplog access (`CBDDCHealthCheck` in Hosting project).

## Design Decisions
- Track confirmation as a persisted watermark per `(peerNodeId, sourceNodeId)` using HLC and hash.
- Use existing vector-clock exchange and successful push results to advance confirmations (no mandatory wire protocol break required).
- Treat tracked peers as pruning blockers until explicitly removed.
- Keep peer registration idempotent and safe for repeated discovery events.

## Data Model and Persistence Plan
### 1. Add peer confirmation tracking model
Create a new persisted model (example name: `PeerOplogConfirmation`) with fields:
- `PeerNodeId`
- `SourceNodeId`
- `ConfirmedWall`
- `ConfirmedLogic`
- `ConfirmedHash`
- `LastConfirmedUtc`
- `IsActive`

### 2. Add store abstraction
Add `IPeerOplogConfirmationStore` with operations:
- `EnsurePeerRegisteredAsync(peerNodeId, address, type)`
- `UpdateConfirmationAsync(peerNodeId, sourceNodeId, timestamp, hash)`
- `GetConfirmationsAsync()` and `GetConfirmationsForPeerAsync(peerNodeId)`
- `RemovePeerTrackingAsync(peerNodeId)`
- `GetActiveTrackedPeersAsync()`

### 3. BLite implementation
- Add entity, mapper, and indexed collection to `CBDDCDocumentDbContext`.
- Index strategy:
- unique `(PeerNodeId, SourceNodeId)`
- index `IsActive`
- index `(SourceNodeId, ConfirmedWall, ConfirmedLogic)` for cutoff scans

### 4. Snapshot support
Include peer-confirmation state in snapshot export/import/merge so pruning safety state survives backup/restore.

## Sync and Pruning Behavior Plan
### 5. Auto-register peers when discovered
On each orchestrator loop, before sync attempts:
- collect merged peer list (discovered + known peers)
- call `EnsurePeerRegisteredAsync(...)` for each peer
- skip local node

### 6. Advance confirmation watermarks
During sync session with a peer:
- after vector-clock exchange, advance watermark for nodes where remote is already ahead/equal
- after successful push batch, advance watermark to max pushed timestamp/hash per source node
- persist updates atomically per peer when possible

### 7. Gate oplog pruning by peer confirmation
Replace retention-only prune trigger with a safe cutoff computation:
- compute retention cutoff from `OplogRetentionHours` (existing behavior)
- compute confirmation cutoff as the minimum confirmed point across active tracked peers
- effective cutoff = minimum(retention cutoff, confirmation cutoff)
- prune only to effective cutoff

If any active tracked peer has no confirmation for relevant source nodes, do not prune those ranges.

### 8. Deprecated peer removal path
Provide explicit management operation to unblock pruning for decommissioned peers:
- add method in management service (example: `RemovePeerTrackingAsync(nodeId, removeRemoteConfig = true)`)
- remove from confirmation tracking store
- optionally remove static peer configuration
- document operator workflow for node deprecation

## Hosting Health Check Plan
### 9. Extend hosting health check payload
Update Hosting `CBDDCHealthCheck` to include peer sync status data:
- tracked peer count
- peers with no confirmation
- max lag (ms) between local head and peer confirmation
- lagging peer list (node IDs)
- last successful confirmation update per peer

### 10. Health status policy
- `Healthy`: persistence OK and all active tracked peers within lag threshold
- `Degraded`: persistence OK but one or more peers lagging/unconfirmed
- `Unhealthy`: persistence unavailable, or critical lag breach (configurable)

Add configurable thresholds in hosting options/cluster options.

## Implementation Phases
### Phase 1: Persistence and contracts
- Add model + store interface + BLite implementation + DI wiring.
- Add tests for CRUD, idempotent register, and explicit remove.

### Phase 2: Sync integration
- Register peers from discovery.
- Update confirmations from vector clock + push success.
- Add sync tests validating watermark advancement.

### Phase 3: Safe pruning
- Implement cutoff calculator service.
- Integrate with orchestrator maintenance path.
- Add two-node tests proving no prune before peer confirmation.

### Phase 4: Management and health
- Expose remove-tracking operation in peer management API.
- Extend hosting healthcheck output and status policy.
- Add hosting healthcheck tests for Healthy/Degraded/Unhealthy.

### Phase 5: Docs and rollout
- Update docs for peer lifecycle and pruning semantics.
- Add upgrade notes and operational runbook for peer deprecation.

## Safe Subagent Usage
- Use subagents only for isolated, low-coupling tasks with clear file ownership boundaries.
- Assign each subagent a narrow scope (one component or one test suite at a time).
- Require explicit task contracts for each subagent including input files/components, expected output, and forbidden operations.
- Prohibit destructive repository actions by subagents (`reset --hard`, force-push, history rewrite, broad file deletion).
- Require subagents to report what changed, why, and which tests were run.
- Do not accept subagent-authored changes directly into final output without primary-agent review.

## Mandatory Verification After Subagent Work
- Enforce a verification gate after every subagent-delivered change before integration.
- Verification gate checklist: independently review the produced diff; run targeted unit/integration tests for touched behavior; validate impacted acceptance criteria; confirm no regressions in pruning safety, peer lifecycle handling, or healthcheck output.
- Reject or rework any subagent output that fails verification.
- Only merge/integrate subagent output after verification evidence is documented.

## Test Plan (Minimum)
- Unit:
- confirmation store upsert/get/remove behavior
- auto-register is idempotent
- safe cutoff computation with mixed peer states
- removing a peer from tracking immediately changes cutoff eligibility
- healthcheck classification rules

- Integration (two-node focus):
- Node B offline: Node A does not prune confirmed-required range
- Node B catches up: Node A prunes once confirmed
- New node join auto-registers without manual call
- Deprecated node removal unblocks pruning

## Risks and Mitigations
- Risk: indefinite growth if a peer never confirms.
- Mitigation: explicit removal workflow and degraded health visibility.

- Risk: confirmation drift after restore/restart.
- Mitigation: snapshot persistence of confirmation records.

- Risk: mixed-version cluster behavior.
- Mitigation: rely on existing vector clock exchange first; keep protocol additions backward compatible if introduced later.

## Acceptance Criteria
- Oplog entries are not pruned while any active tracked peer has not confirmed required ranges.
- Newly discovered peers are automatically present in tracking without operator action.
- Operators can explicitly remove a deprecated peer from tracking and pruning resumes accordingly.
- Hosting health endpoint exposes peer sync lag/confirmation status and returns degraded/unhealthy when appropriate.