# Plan: Peer-Confirmed Oplog Retention and Peer Sync Health ## Objective Move from time-only oplog pruning to peer-confirmed pruning so entries are not removed until peers have confirmed them, while also adding peer lifecycle management and health visibility. ## Requested Outcomes - Oplogs are not cleared until confirmed by peers. - Each node tracks latest oplog confirmation per peer. - New peers are automatically registered when they join. - Deprecated peers can be explicitly removed from tracking. - Hosting health checks report peer sync status, not only store availability. ## Current Baseline (Codebase) - Pruning is retention-time based in `SyncOrchestrator` + `IOplogStore.PruneOplogAsync(...)`. - Push ACK is binary success/fail (`AckResponse`) and does not persist peer confirmation state. - Peer discovery exists (`IDiscoveryService`, `UdpDiscoveryService`, `CompositeDiscoveryService`). - Persistent remote peer config exists (`IPeerConfigurationStore`, `PeerManagementService`). - Hosting health check only validates oplog access (`CBDDCHealthCheck` in Hosting project). ## Design Decisions - Track confirmation as a persisted watermark per `(peerNodeId, sourceNodeId)` using HLC and hash. - Use existing vector-clock exchange and successful push results to advance confirmations (no mandatory wire protocol break required). - Treat tracked peers as pruning blockers until explicitly removed. - Keep peer registration idempotent and safe for repeated discovery events. ## Data Model and Persistence Plan ### 1. Add peer confirmation tracking model Create a new persisted model (example name: `PeerOplogConfirmation`) with fields: - `PeerNodeId` - `SourceNodeId` - `ConfirmedWall` - `ConfirmedLogic` - `ConfirmedHash` - `LastConfirmedUtc` - `IsActive` ### 2. Add store abstraction Add `IPeerOplogConfirmationStore` with operations: - `EnsurePeerRegisteredAsync(peerNodeId, address, type)` - `UpdateConfirmationAsync(peerNodeId, sourceNodeId, timestamp, hash)` - `GetConfirmationsAsync()` and `GetConfirmationsForPeerAsync(peerNodeId)` - `RemovePeerTrackingAsync(peerNodeId)` - `GetActiveTrackedPeersAsync()` ### 3. BLite implementation - Add entity, mapper, and indexed collection to `CBDDCDocumentDbContext`. - Index strategy: - unique `(PeerNodeId, SourceNodeId)` - index `IsActive` - index `(SourceNodeId, ConfirmedWall, ConfirmedLogic)` for cutoff scans ### 4. Snapshot support Include peer-confirmation state in snapshot export/import/merge so pruning safety state survives backup/restore. ## Sync and Pruning Behavior Plan ### 5. Auto-register peers when discovered On each orchestrator loop, before sync attempts: - collect merged peer list (discovered + known peers) - call `EnsurePeerRegisteredAsync(...)` for each peer - skip local node ### 6. Advance confirmation watermarks During sync session with a peer: - after vector-clock exchange, advance watermark for nodes where remote is already ahead/equal - after successful push batch, advance watermark to max pushed timestamp/hash per source node - persist updates atomically per peer when possible ### 7. Gate oplog pruning by peer confirmation Replace retention-only prune trigger with a safe cutoff computation: - compute retention cutoff from `OplogRetentionHours` (existing behavior) - compute confirmation cutoff as the minimum confirmed point across active tracked peers - effective cutoff = minimum(retention cutoff, confirmation cutoff) - prune only to effective cutoff If any active tracked peer has no confirmation for relevant source nodes, do not prune those ranges. ### 8. Deprecated peer removal path Provide explicit management operation to unblock pruning for decommissioned peers: - add method in management service (example: `RemovePeerTrackingAsync(nodeId, removeRemoteConfig = true)`) - remove from confirmation tracking store - optionally remove static peer configuration - document operator workflow for node deprecation ## Hosting Health Check Plan ### 9. Extend hosting health check payload Update Hosting `CBDDCHealthCheck` to include peer sync status data: - tracked peer count - peers with no confirmation - max lag (ms) between local head and peer confirmation - lagging peer list (node IDs) - last successful confirmation update per peer ### 10. Health status policy - `Healthy`: persistence OK and all active tracked peers within lag threshold - `Degraded`: persistence OK but one or more peers lagging/unconfirmed - `Unhealthy`: persistence unavailable, or critical lag breach (configurable) Add configurable thresholds in hosting options/cluster options. ## Implementation Phases ### Phase 1: Persistence and contracts - Add model + store interface + BLite implementation + DI wiring. - Add tests for CRUD, idempotent register, and explicit remove. ### Phase 2: Sync integration - Register peers from discovery. - Update confirmations from vector clock + push success. - Add sync tests validating watermark advancement. ### Phase 3: Safe pruning - Implement cutoff calculator service. - Integrate with orchestrator maintenance path. - Add two-node tests proving no prune before peer confirmation. ### Phase 4: Management and health - Expose remove-tracking operation in peer management API. - Extend hosting healthcheck output and status policy. - Add hosting healthcheck tests for Healthy/Degraded/Unhealthy. ### Phase 5: Docs and rollout - Update docs for peer lifecycle and pruning semantics. - Add upgrade notes and operational runbook for peer deprecation. ## Safe Subagent Usage - Use subagents only for isolated, low-coupling tasks with clear file ownership boundaries. - Assign each subagent a narrow scope (one component or one test suite at a time). - Require explicit task contracts for each subagent including input files/components, expected output, and forbidden operations. - Prohibit destructive repository actions by subagents (`reset --hard`, force-push, history rewrite, broad file deletion). - Require subagents to report what changed, why, and which tests were run. - Do not accept subagent-authored changes directly into final output without primary-agent review. ## Mandatory Verification After Subagent Work - Enforce a verification gate after every subagent-delivered change before integration. - Verification gate checklist: independently review the produced diff; run targeted unit/integration tests for touched behavior; validate impacted acceptance criteria; confirm no regressions in pruning safety, peer lifecycle handling, or healthcheck output. - Reject or rework any subagent output that fails verification. - Only merge/integrate subagent output after verification evidence is documented. ## Test Plan (Minimum) - Unit: - confirmation store upsert/get/remove behavior - auto-register is idempotent - safe cutoff computation with mixed peer states - removing a peer from tracking immediately changes cutoff eligibility - healthcheck classification rules - Integration (two-node focus): - Node B offline: Node A does not prune confirmed-required range - Node B catches up: Node A prunes once confirmed - New node join auto-registers without manual call - Deprecated node removal unblocks pruning ## Risks and Mitigations - Risk: indefinite growth if a peer never confirms. - Mitigation: explicit removal workflow and degraded health visibility. - Risk: confirmation drift after restore/restart. - Mitigation: snapshot persistence of confirmation records. - Risk: mixed-version cluster behavior. - Mitigation: rely on existing vector clock exchange first; keep protocol additions backward compatible if introduced later. ## Acceptance Criteria - Oplog entries are not pruned while any active tracked peer has not confirmed required ranges. - Newly discovered peers are automatically present in tracking without operator action. - Operators can explicitly remove a deprecated peer from tracking and pruning resumes accordingly. - Hosting health endpoint exposes peer sync lag/confirmation status and returns degraded/unhealthy when appropriate.