docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s

Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
Joseph Doherty
2026-02-20 13:23:55 -05:00
parent e6d81f6350
commit ce727eb30d
18 changed files with 783 additions and 186 deletions

16
docs/features/README.md Normal file
View File

@@ -0,0 +1,16 @@
# Major Feature Inventory
This index tracks CBDDC major functionality. Each feature has one canonical document.
## Features
- [Selective Collection Sync](selective-collection-sync.md)
- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
- [Secure Peer Transport](secure-peer-transport.md)
- [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
## Maintenance Rules
- Keep one file per major feature.
- Update feature docs when APIs, operations, or security controls change.
- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md).

View File

@@ -0,0 +1,68 @@
# Feature: Peer-Confirmed Pruning
## Purpose and Business Outcome
Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams.
## Scope and Non-Goals
Scope:
- Retention cutoff plus peer confirmation cutoff logic.
- Operational controls for peer tracking and de-tracking.
Non-goals:
- Automatic removal of retired peers without operator action.
- Replacement for backup/restore strategy.
## User and System Workflows
1. Maintenance job calculates retention and confirmation cutoffs.
2. System blocks pruning when required confirmations are missing.
3. Operator de-tracks retired peers when appropriate.
4. Pruning resumes once constraints are satisfied.
## Interfaces, APIs, and Events Involved
- Maintenance pruning scheduler
- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`)
- Health metrics for lag and missing confirmations
## Permissions and Data Handling
- Only approved operators should modify peer tracking state.
- Pruning decisions must be auditable through logs and incident notes.
## Dependencies and Failure Modes
Dependencies:
- Accurate peer tracking metadata
- Timely confirmation updates per source stream
Failure modes:
- Pruning blocked indefinitely by stale peer tracking
- Unsafe pruning if controls are bypassed
## Monitoring, Alerts, and Troubleshooting Pointers
- Monitor peers with missing confirmations and sustained lag.
- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions.
## Rollout and Change Considerations
- Introduce pruning policy changes with explicit maintenance windows.
- Validate expected cutoff behavior in staging before production rollout.
## Validation and Testability Guidance
- Add tests for blocked prune when confirmations are missing.
- Add tests for resumed prune after de-tracking retired peers.
- Smoke test health status transitions around peer lifecycle changes.
## Related Security Controls
- [Security](../security.md)
- [Runbook](../runbook.md)

View File

@@ -0,0 +1,70 @@
# Feature: Peer-to-Peer Gossip Sync
## Purpose and Business Outcome
Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity.
## Scope and Non-Goals
Scope:
- Peer discovery and sync orchestration.
- Push/pull propagation of oplog changes.
Non-goals:
- Strong global consistency guarantees.
- Public internet exposure without additional network controls.
## User and System Workflows
1. Node starts and discovers peers.
2. Node exchanges sync metadata with connected peers.
3. Missing operations are requested and applied.
4. Mesh converges over repeated gossip rounds.
## Interfaces, APIs, and Events Involved
- Sync orchestrator scheduling
- TCP peer sync channels
- Vector clock exchange and reconciliation
## Permissions and Data Handling
- Peers with valid authentication token can exchange replicated collection data.
- Cluster membership should be restricted to trusted nodes.
## Dependencies and Failure Modes
Dependencies:
- Network reachability
- Shared authentication material
- Healthy persistence layer
Failure modes:
- Peer isolation due to network outage
- Token mismatch blocking synchronization
- Sustained lag under high write pressure
## Monitoring, Alerts, and Troubleshooting Pointers
- Monitor `laggingPeers`, `maxLagMs`, and active peer counts.
- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers.
## Rollout and Change Considerations
- Roll out protocol-affecting changes in a controlled window.
- Confirm backward/forward compatibility in staging mesh before production rollout.
## Validation and Testability Guidance
- Run multi-node integration tests with controlled partitions.
- Validate eventual convergence after reconnect.
- Verify no data-loss under repeated reconnect scenarios.
## Related Security Controls
- [Security](../security.md)
- [Access and Permissions](../access.md)

View File

@@ -0,0 +1,67 @@
# Feature: Secure Peer Transport
## Purpose and Business Outcome
Protect replicated data in transit with authenticated and encrypted peer communication.
## Scope and Non-Goals
Scope:
- Secure handshake and key establishment.
- Message confidentiality and integrity controls.
Non-goals:
- Data-at-rest encryption.
- Full identity and certificate lifecycle management.
## User and System Workflows
1. Operator enables secure transport components.
2. Peers perform handshake and establish session keys.
3. Replication traffic is encrypted/authenticated.
4. Health and logs expose secure mode status.
## Interfaces, APIs, and Events Involved
- `IPeerHandshakeService` / secure handshake implementation
- Network pipeline message encryption and HMAC validation
- Startup configuration for secure mode
## Permissions and Data Handling
- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators.
- Logs must avoid plaintext secret disclosure.
## Dependencies and Failure Modes
Dependencies:
- Consistent security mode across peers
- Valid runtime cryptographic dependencies
Failure modes:
- Secure/plaintext mode mismatch
- Handshake failure due to key/token mismatch
## Monitoring, Alerts, and Troubleshooting Pointers
- Alert on repeated handshake failures.
- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation.
## Rollout and Change Considerations
- Enable secure mode in staging first.
- Roll production nodes in controlled order to avoid mixed-mode partitions.
## Validation and Testability Guidance
- Add tests for secure-to-secure success and mixed-mode rejection.
- Validate encrypted cluster startup and sync with production-like load.
## Related Security Controls
- [Security](../security.md)
- [Access and Permissions](../access.md)

View File

@@ -0,0 +1,69 @@
# Feature: Selective Collection Sync
## Purpose and Business Outcome
Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data.
## Scope and Non-Goals
Scope:
- Register collections for replication using `WatchCollection()`.
- Replicate changes for registered collections across peers.
Non-goals:
- Automatic replication of all database collections.
- Schema migration management.
## User and System Workflows
1. Developer registers target collections in the document store.
2. Local writes trigger CDC events.
3. Oplog entries propagate through peer sync.
4. Remote peers apply updates for matching collections.
## Interfaces, APIs, and Events Involved
- `WatchCollection(collectionName, collection, keySelector)`
- CDC trigger pipeline
- Oplog append and apply operations
## Permissions and Data Handling
- Access to source collections is controlled by host application permissions.
- Only approved collections should be registered for sync in sensitive environments.
## Dependencies and Failure Modes
Dependencies:
- Correct collection registration
- Stable peer connectivity
- Persistence availability
Failure modes:
- Missed replication due to unregistered collection
- Delayed propagation during network partition
## Monitoring, Alerts, and Troubleshooting Pointers
- Monitor replication lag and peer confirmation metrics.
- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response.
## Rollout and Change Considerations
- Introduce new synced collections behind staged rollout.
- Validate downstream consumer compatibility before production enablement.
## Validation and Testability Guidance
- Add integration tests verifying only registered collections replicate.
- Smoke test by writing to registered and non-registered collections and confirming expected behavior.
- Validate no unexpected collection appears in remote peers after deployment.
## Related Security Controls
- [Security](../security.md)
- [Access and Permissions](../access.md)