docs: align internal docs to enterprise standards

Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00
parent e6d81f6350
commit ce727eb30d
18 changed files with 783 additions and 186 deletions
--- a/docs/features/README.md
+++ b/docs/features/README.md
@@ -0,0 +1,16 @@
+# Major Feature Inventory
+
+This index tracks CBDDC major functionality. Each feature has one canonical document.
+
+## Features
+
+- [Selective Collection Sync](selective-collection-sync.md)
+- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
+- [Secure Peer Transport](secure-peer-transport.md)
+- [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
+
+## Maintenance Rules
+
+- Keep one file per major feature.
+- Update feature docs when APIs, operations, or security controls change.
+- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md).
--- a/docs/features/peer-confirmed-pruning.md
+++ b/docs/features/peer-confirmed-pruning.md
@@ -0,0 +1,68 @@
+# Feature: Peer-Confirmed Pruning
+
+## Purpose and Business Outcome
+
+Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Retention cutoff plus peer confirmation cutoff logic.
+- Operational controls for peer tracking and de-tracking.
+
+Non-goals:
+
+- Automatic removal of retired peers without operator action.
+- Replacement for backup/restore strategy.
+
+## User and System Workflows
+
+1. Maintenance job calculates retention and confirmation cutoffs.
+2. System blocks pruning when required confirmations are missing.
+3. Operator de-tracks retired peers when appropriate.
+4. Pruning resumes once constraints are satisfied.
+
+## Interfaces, APIs, and Events Involved
+
+- Maintenance pruning scheduler
+- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`)
+- Health metrics for lag and missing confirmations
+
+## Permissions and Data Handling
+
+- Only approved operators should modify peer tracking state.
+- Pruning decisions must be auditable through logs and incident notes.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Accurate peer tracking metadata
+- Timely confirmation updates per source stream
+
+Failure modes:
+
+- Pruning blocked indefinitely by stale peer tracking
+- Unsafe pruning if controls are bypassed
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Monitor peers with missing confirmations and sustained lag.
+- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions.
+
+## Rollout and Change Considerations
+
+- Introduce pruning policy changes with explicit maintenance windows.
+- Validate expected cutoff behavior in staging before production rollout.
+
+## Validation and Testability Guidance
+
+- Add tests for blocked prune when confirmations are missing.
+- Add tests for resumed prune after de-tracking retired peers.
+- Smoke test health status transitions around peer lifecycle changes.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Runbook](../runbook.md)
--- a/docs/features/peer-to-peer-gossip-sync.md
+++ b/docs/features/peer-to-peer-gossip-sync.md
@@ -0,0 +1,70 @@
+# Feature: Peer-to-Peer Gossip Sync
+
+## Purpose and Business Outcome
+
+Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Peer discovery and sync orchestration.
+- Push/pull propagation of oplog changes.
+
+Non-goals:
+
+- Strong global consistency guarantees.
+- Public internet exposure without additional network controls.
+
+## User and System Workflows
+
+1. Node starts and discovers peers.
+2. Node exchanges sync metadata with connected peers.
+3. Missing operations are requested and applied.
+4. Mesh converges over repeated gossip rounds.
+
+## Interfaces, APIs, and Events Involved
+
+- Sync orchestrator scheduling
+- TCP peer sync channels
+- Vector clock exchange and reconciliation
+
+## Permissions and Data Handling
+
+- Peers with valid authentication token can exchange replicated collection data.
+- Cluster membership should be restricted to trusted nodes.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Network reachability
+- Shared authentication material
+- Healthy persistence layer
+
+Failure modes:
+
+- Peer isolation due to network outage
+- Token mismatch blocking synchronization
+- Sustained lag under high write pressure
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Monitor `laggingPeers`, `maxLagMs`, and active peer counts.
+- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers.
+
+## Rollout and Change Considerations
+
+- Roll out protocol-affecting changes in a controlled window.
+- Confirm backward/forward compatibility in staging mesh before production rollout.
+
+## Validation and Testability Guidance
+
+- Run multi-node integration tests with controlled partitions.
+- Validate eventual convergence after reconnect.
+- Verify no data-loss under repeated reconnect scenarios.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Access and Permissions](../access.md)
--- a/docs/features/secure-peer-transport.md
+++ b/docs/features/secure-peer-transport.md
@@ -0,0 +1,67 @@
+# Feature: Secure Peer Transport
+
+## Purpose and Business Outcome
+
+Protect replicated data in transit with authenticated and encrypted peer communication.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Secure handshake and key establishment.
+- Message confidentiality and integrity controls.
+
+Non-goals:
+
+- Data-at-rest encryption.
+- Full identity and certificate lifecycle management.
+
+## User and System Workflows
+
+1. Operator enables secure transport components.
+2. Peers perform handshake and establish session keys.
+3. Replication traffic is encrypted/authenticated.
+4. Health and logs expose secure mode status.
+
+## Interfaces, APIs, and Events Involved
+
+- `IPeerHandshakeService` / secure handshake implementation
+- Network pipeline message encryption and HMAC validation
+- Startup configuration for secure mode
+
+## Permissions and Data Handling
+
+- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators.
+- Logs must avoid plaintext secret disclosure.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Consistent security mode across peers
+- Valid runtime cryptographic dependencies
+
+Failure modes:
+
+- Secure/plaintext mode mismatch
+- Handshake failure due to key/token mismatch
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Alert on repeated handshake failures.
+- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation.
+
+## Rollout and Change Considerations
+
+- Enable secure mode in staging first.
+- Roll production nodes in controlled order to avoid mixed-mode partitions.
+
+## Validation and Testability Guidance
+
+- Add tests for secure-to-secure success and mixed-mode rejection.
+- Validate encrypted cluster startup and sync with production-like load.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Access and Permissions](../access.md)
--- a/docs/features/selective-collection-sync.md
+++ b/docs/features/selective-collection-sync.md
@@ -0,0 +1,69 @@
+# Feature: Selective Collection Sync
+
+## Purpose and Business Outcome
+
+Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Register collections for replication using `WatchCollection()`.
+- Replicate changes for registered collections across peers.
+
+Non-goals:
+
+- Automatic replication of all database collections.
+- Schema migration management.
+
+## User and System Workflows
+
+1. Developer registers target collections in the document store.
+2. Local writes trigger CDC events.
+3. Oplog entries propagate through peer sync.
+4. Remote peers apply updates for matching collections.
+
+## Interfaces, APIs, and Events Involved
+
+- `WatchCollection(collectionName, collection, keySelector)`
+- CDC trigger pipeline
+- Oplog append and apply operations
+
+## Permissions and Data Handling
+
+- Access to source collections is controlled by host application permissions.
+- Only approved collections should be registered for sync in sensitive environments.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Correct collection registration
+- Stable peer connectivity
+- Persistence availability
+
+Failure modes:
+
+- Missed replication due to unregistered collection
+- Delayed propagation during network partition
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Monitor replication lag and peer confirmation metrics.
+- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response.
+
+## Rollout and Change Considerations
+
+- Introduce new synced collections behind staged rollout.
+- Validate downstream consumer compatibility before production enablement.
+
+## Validation and Testability Guidance
+
+- Add integration tests verifying only registered collections replicate.
+- Smoke test by writing to registered and non-registered collections and confirming expected behavior.
+- Validate no unexpected collection appears in remote peers after deployment.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Access and Permissions](../access.md)