Files

Joseph Doherty 08bfc17218

CI / verify (push) Has been cancelled

Details

Initial import of the CBDDC codebase with docs and tests. Add a .NET-focused gitignore to keep generated artifacts out of source control.

2026-02-20 13:03:21 -05:00

7.8 KiB

Raw Blame History

Plan: Peer-Confirmed Oplog Retention and Peer Sync Health

Objective

Move from time-only oplog pruning to peer-confirmed pruning so entries are not removed until peers have confirmed them, while also adding peer lifecycle management and health visibility.

Requested Outcomes

Oplogs are not cleared until confirmed by peers.
Each node tracks latest oplog confirmation per peer.
New peers are automatically registered when they join.
Deprecated peers can be explicitly removed from tracking.
Hosting health checks report peer sync status, not only store availability.

Current Baseline (Codebase)

Pruning is retention-time based in SyncOrchestrator + IOplogStore.PruneOplogAsync(...).
Push ACK is binary success/fail (AckResponse) and does not persist peer confirmation state.
Peer discovery exists (IDiscoveryService, UdpDiscoveryService, CompositeDiscoveryService).
Persistent remote peer config exists (IPeerConfigurationStore, PeerManagementService).
Hosting health check only validates oplog access (CBDDCHealthCheck in Hosting project).

Design Decisions

Track confirmation as a persisted watermark per (peerNodeId, sourceNodeId) using HLC and hash.
Use existing vector-clock exchange and successful push results to advance confirmations (no mandatory wire protocol break required).
Treat tracked peers as pruning blockers until explicitly removed.
Keep peer registration idempotent and safe for repeated discovery events.

Data Model and Persistence Plan

1. Add peer confirmation tracking model

Create a new persisted model (example name: PeerOplogConfirmation) with fields:

PeerNodeId
SourceNodeId
ConfirmedWall
ConfirmedLogic
ConfirmedHash
LastConfirmedUtc
IsActive

2. Add store abstraction

Add IPeerOplogConfirmationStore with operations:

EnsurePeerRegisteredAsync(peerNodeId, address, type)
UpdateConfirmationAsync(peerNodeId, sourceNodeId, timestamp, hash)
GetConfirmationsAsync() and GetConfirmationsForPeerAsync(peerNodeId)
RemovePeerTrackingAsync(peerNodeId)
GetActiveTrackedPeersAsync()

3. BLite implementation

Add entity, mapper, and indexed collection to CBDDCDocumentDbContext.
Index strategy:
unique (PeerNodeId, SourceNodeId)
index IsActive
index (SourceNodeId, ConfirmedWall, ConfirmedLogic) for cutoff scans

4. Snapshot support

Include peer-confirmation state in snapshot export/import/merge so pruning safety state survives backup/restore.

Sync and Pruning Behavior Plan

5. Auto-register peers when discovered

On each orchestrator loop, before sync attempts:

collect merged peer list (discovered + known peers)
call EnsurePeerRegisteredAsync(...) for each peer
skip local node

6. Advance confirmation watermarks

During sync session with a peer:

after vector-clock exchange, advance watermark for nodes where remote is already ahead/equal
after successful push batch, advance watermark to max pushed timestamp/hash per source node
persist updates atomically per peer when possible

7. Gate oplog pruning by peer confirmation

Replace retention-only prune trigger with a safe cutoff computation:

compute retention cutoff from OplogRetentionHours (existing behavior)
compute confirmation cutoff as the minimum confirmed point across active tracked peers
effective cutoff = minimum(retention cutoff, confirmation cutoff)
prune only to effective cutoff

If any active tracked peer has no confirmation for relevant source nodes, do not prune those ranges.

8. Deprecated peer removal path

Provide explicit management operation to unblock pruning for decommissioned peers:

add method in management service (example: RemovePeerTrackingAsync(nodeId, removeRemoteConfig = true))
remove from confirmation tracking store
optionally remove static peer configuration
document operator workflow for node deprecation

Hosting Health Check Plan

9. Extend hosting health check payload

Update Hosting CBDDCHealthCheck to include peer sync status data:

tracked peer count
peers with no confirmation
max lag (ms) between local head and peer confirmation
lagging peer list (node IDs)
last successful confirmation update per peer

10. Health status policy

Healthy: persistence OK and all active tracked peers within lag threshold
Degraded: persistence OK but one or more peers lagging/unconfirmed
Unhealthy: persistence unavailable, or critical lag breach (configurable)

Add configurable thresholds in hosting options/cluster options.

Implementation Phases

Phase 1: Persistence and contracts

Add model + store interface + BLite implementation + DI wiring.
Add tests for CRUD, idempotent register, and explicit remove.

Phase 2: Sync integration

Register peers from discovery.
Update confirmations from vector clock + push success.
Add sync tests validating watermark advancement.

Phase 3: Safe pruning

Implement cutoff calculator service.
Integrate with orchestrator maintenance path.
Add two-node tests proving no prune before peer confirmation.

Phase 4: Management and health

Expose remove-tracking operation in peer management API.
Extend hosting healthcheck output and status policy.
Add hosting healthcheck tests for Healthy/Degraded/Unhealthy.

Phase 5: Docs and rollout

Update docs for peer lifecycle and pruning semantics.
Add upgrade notes and operational runbook for peer deprecation.

Safe Subagent Usage

Use subagents only for isolated, low-coupling tasks with clear file ownership boundaries.
Assign each subagent a narrow scope (one component or one test suite at a time).
Require explicit task contracts for each subagent including input files/components, expected output, and forbidden operations.
Prohibit destructive repository actions by subagents (reset --hard, force-push, history rewrite, broad file deletion).
Require subagents to report what changed, why, and which tests were run.
Do not accept subagent-authored changes directly into final output without primary-agent review.

Mandatory Verification After Subagent Work

Enforce a verification gate after every subagent-delivered change before integration.
Verification gate checklist: independently review the produced diff; run targeted unit/integration tests for touched behavior; validate impacted acceptance criteria; confirm no regressions in pruning safety, peer lifecycle handling, or healthcheck output.
Reject or rework any subagent output that fails verification.
Only merge/integrate subagent output after verification evidence is documented.

Test Plan (Minimum)

Unit:
confirmation store upsert/get/remove behavior
auto-register is idempotent
safe cutoff computation with mixed peer states
removing a peer from tracking immediately changes cutoff eligibility
healthcheck classification rules
Integration (two-node focus):
Node B offline: Node A does not prune confirmed-required range
Node B catches up: Node A prunes once confirmed
New node join auto-registers without manual call
Deprecated node removal unblocks pruning

Risks and Mitigations

Risk: indefinite growth if a peer never confirms.
Mitigation: explicit removal workflow and degraded health visibility.
Risk: confirmation drift after restore/restart.
Mitigation: snapshot persistence of confirmation records.
Risk: mixed-version cluster behavior.
Mitigation: rely on existing vector clock exchange first; keep protocol additions backward compatible if introduced later.

Acceptance Criteria

Oplog entries are not pruned while any active tracked peer has not confirmed required ranges.
Newly discovered peers are automatically present in tracking without operator action.
Operators can explicitly remove a deprecated peer from tracking and pruning resumes accordingly.
Hosting health endpoint exposes peer sync lag/confirmation status and returns degraded/unhealthy when appropriate.

7.8 KiB Raw Blame History