Initial import of the CBDDC codebase with docs and tests. Add a .NET-focused gitignore to keep generated artifacts out of source control.

2026-02-20 13:03:21 -05:00
commit 08bfc17218
218 changed files with 33910 additions and 0 deletions
--- a/twonode.md
+++ b/twonode.md
@@ -0,0 +1,166 @@
+# Plan: Peer-Confirmed Oplog Retention and Peer Sync Health
+
+## Objective
+Move from time-only oplog pruning to peer-confirmed pruning so entries are not removed until peers have confirmed them, while also adding peer lifecycle management and health visibility.
+
+## Requested Outcomes
+- Oplogs are not cleared until confirmed by peers.
+- Each node tracks latest oplog confirmation per peer.
+- New peers are automatically registered when they join.
+- Deprecated peers can be explicitly removed from tracking.
+- Hosting health checks report peer sync status, not only store availability.
+
+## Current Baseline (Codebase)
+- Pruning is retention-time based in `SyncOrchestrator` + `IOplogStore.PruneOplogAsync(...)`.
+- Push ACK is binary success/fail (`AckResponse`) and does not persist peer confirmation state.
+- Peer discovery exists (`IDiscoveryService`, `UdpDiscoveryService`, `CompositeDiscoveryService`).
+- Persistent remote peer config exists (`IPeerConfigurationStore`, `PeerManagementService`).
+- Hosting health check only validates oplog access (`CBDDCHealthCheck` in Hosting project).
+
+## Design Decisions
+- Track confirmation as a persisted watermark per `(peerNodeId, sourceNodeId)` using HLC and hash.
+- Use existing vector-clock exchange and successful push results to advance confirmations (no mandatory wire protocol break required).
+- Treat tracked peers as pruning blockers until explicitly removed.
+- Keep peer registration idempotent and safe for repeated discovery events.
+
+## Data Model and Persistence Plan
+### 1. Add peer confirmation tracking model
+Create a new persisted model (example name: `PeerOplogConfirmation`) with fields:
+- `PeerNodeId`
+- `SourceNodeId`
+- `ConfirmedWall`
+- `ConfirmedLogic`
+- `ConfirmedHash`
+- `LastConfirmedUtc`
+- `IsActive`
+
+### 2. Add store abstraction
+Add `IPeerOplogConfirmationStore` with operations:
+- `EnsurePeerRegisteredAsync(peerNodeId, address, type)`
+- `UpdateConfirmationAsync(peerNodeId, sourceNodeId, timestamp, hash)`
+- `GetConfirmationsAsync()` and `GetConfirmationsForPeerAsync(peerNodeId)`
+- `RemovePeerTrackingAsync(peerNodeId)`
+- `GetActiveTrackedPeersAsync()`
+
+### 3. BLite implementation
+- Add entity, mapper, and indexed collection to `CBDDCDocumentDbContext`.
+- Index strategy:
+- unique `(PeerNodeId, SourceNodeId)`
+- index `IsActive`
+- index `(SourceNodeId, ConfirmedWall, ConfirmedLogic)` for cutoff scans
+
+### 4. Snapshot support
+Include peer-confirmation state in snapshot export/import/merge so pruning safety state survives backup/restore.
+
+## Sync and Pruning Behavior Plan
+### 5. Auto-register peers when discovered
+On each orchestrator loop, before sync attempts:
+- collect merged peer list (discovered + known peers)
+- call `EnsurePeerRegisteredAsync(...)` for each peer
+- skip local node
+
+### 6. Advance confirmation watermarks
+During sync session with a peer:
+- after vector-clock exchange, advance watermark for nodes where remote is already ahead/equal
+- after successful push batch, advance watermark to max pushed timestamp/hash per source node
+- persist updates atomically per peer when possible
+
+### 7. Gate oplog pruning by peer confirmation
+Replace retention-only prune trigger with a safe cutoff computation:
+- compute retention cutoff from `OplogRetentionHours` (existing behavior)
+- compute confirmation cutoff as the minimum confirmed point across active tracked peers
+- effective cutoff = minimum(retention cutoff, confirmation cutoff)
+- prune only to effective cutoff
+
+If any active tracked peer has no confirmation for relevant source nodes, do not prune those ranges.
+
+### 8. Deprecated peer removal path
+Provide explicit management operation to unblock pruning for decommissioned peers:
+- add method in management service (example: `RemovePeerTrackingAsync(nodeId, removeRemoteConfig = true)`)
+- remove from confirmation tracking store
+- optionally remove static peer configuration
+- document operator workflow for node deprecation
+
+## Hosting Health Check Plan
+### 9. Extend hosting health check payload
+Update Hosting `CBDDCHealthCheck` to include peer sync status data:
+- tracked peer count
+- peers with no confirmation
+- max lag (ms) between local head and peer confirmation
+- lagging peer list (node IDs)
+- last successful confirmation update per peer
+
+### 10. Health status policy
+- `Healthy`: persistence OK and all active tracked peers within lag threshold
+- `Degraded`: persistence OK but one or more peers lagging/unconfirmed
+- `Unhealthy`: persistence unavailable, or critical lag breach (configurable)
+
+Add configurable thresholds in hosting options/cluster options.
+
+## Implementation Phases
+### Phase 1: Persistence and contracts
+- Add model + store interface + BLite implementation + DI wiring.
+- Add tests for CRUD, idempotent register, and explicit remove.
+
+### Phase 2: Sync integration
+- Register peers from discovery.
+- Update confirmations from vector clock + push success.
+- Add sync tests validating watermark advancement.
+
+### Phase 3: Safe pruning
+- Implement cutoff calculator service.
+- Integrate with orchestrator maintenance path.
+- Add two-node tests proving no prune before peer confirmation.
+
+### Phase 4: Management and health
+- Expose remove-tracking operation in peer management API.
+- Extend hosting healthcheck output and status policy.
+- Add hosting healthcheck tests for Healthy/Degraded/Unhealthy.
+
+### Phase 5: Docs and rollout
+- Update docs for peer lifecycle and pruning semantics.
+- Add upgrade notes and operational runbook for peer deprecation.
+
+## Safe Subagent Usage
+- Use subagents only for isolated, low-coupling tasks with clear file ownership boundaries.
+- Assign each subagent a narrow scope (one component or one test suite at a time).
+- Require explicit task contracts for each subagent including input files/components, expected output, and forbidden operations.
+- Prohibit destructive repository actions by subagents (`reset --hard`, force-push, history rewrite, broad file deletion).
+- Require subagents to report what changed, why, and which tests were run.
+- Do not accept subagent-authored changes directly into final output without primary-agent review.
+
+## Mandatory Verification After Subagent Work
+- Enforce a verification gate after every subagent-delivered change before integration.
+- Verification gate checklist: independently review the produced diff; run targeted unit/integration tests for touched behavior; validate impacted acceptance criteria; confirm no regressions in pruning safety, peer lifecycle handling, or healthcheck output.
+- Reject or rework any subagent output that fails verification.
+- Only merge/integrate subagent output after verification evidence is documented.
+
+## Test Plan (Minimum)
+- Unit:
+- confirmation store upsert/get/remove behavior
+- auto-register is idempotent
+- safe cutoff computation with mixed peer states
+- removing a peer from tracking immediately changes cutoff eligibility
+- healthcheck classification rules
+
+- Integration (two-node focus):
+- Node B offline: Node A does not prune confirmed-required range
+- Node B catches up: Node A prunes once confirmed
+- New node join auto-registers without manual call
+- Deprecated node removal unblocks pruning
+
+## Risks and Mitigations
+- Risk: indefinite growth if a peer never confirms.
+- Mitigation: explicit removal workflow and degraded health visibility.
+
+- Risk: confirmation drift after restore/restart.
+- Mitigation: snapshot persistence of confirmation records.
+
+- Risk: mixed-version cluster behavior.
+- Mitigation: rely on existing vector clock exchange first; keep protocol additions backward compatible if introduced later.
+
+## Acceptance Criteria
+- Oplog entries are not pruned while any active tracked peer has not confirmed required ranges.
+- Newly discovered peers are automatically present in tracking without operator action.
+- Operators can explicitly remove a deprecated peer from tracking and pruning resumes accordingly.
+- Hosting health endpoint exposes peer sync lag/confirmation status and returns degraded/unhealthy when appropriate.