docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s
All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
16
docs/features/README.md
Normal file
16
docs/features/README.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# Major Feature Inventory
|
||||
|
||||
This index tracks CBDDC major functionality. Each feature has one canonical document.
|
||||
|
||||
## Features
|
||||
|
||||
- [Selective Collection Sync](selective-collection-sync.md)
|
||||
- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
|
||||
- [Secure Peer Transport](secure-peer-transport.md)
|
||||
- [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
|
||||
|
||||
## Maintenance Rules
|
||||
|
||||
- Keep one file per major feature.
|
||||
- Update feature docs when APIs, operations, or security controls change.
|
||||
- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md).
|
||||
68
docs/features/peer-confirmed-pruning.md
Normal file
68
docs/features/peer-confirmed-pruning.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# Feature: Peer-Confirmed Pruning
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Retention cutoff plus peer confirmation cutoff logic.
|
||||
- Operational controls for peer tracking and de-tracking.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Automatic removal of retired peers without operator action.
|
||||
- Replacement for backup/restore strategy.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Maintenance job calculates retention and confirmation cutoffs.
|
||||
2. System blocks pruning when required confirmations are missing.
|
||||
3. Operator de-tracks retired peers when appropriate.
|
||||
4. Pruning resumes once constraints are satisfied.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- Maintenance pruning scheduler
|
||||
- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`)
|
||||
- Health metrics for lag and missing confirmations
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Only approved operators should modify peer tracking state.
|
||||
- Pruning decisions must be auditable through logs and incident notes.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Accurate peer tracking metadata
|
||||
- Timely confirmation updates per source stream
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Pruning blocked indefinitely by stale peer tracking
|
||||
- Unsafe pruning if controls are bypassed
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Monitor peers with missing confirmations and sustained lag.
|
||||
- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Introduce pruning policy changes with explicit maintenance windows.
|
||||
- Validate expected cutoff behavior in staging before production rollout.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Add tests for blocked prune when confirmations are missing.
|
||||
- Add tests for resumed prune after de-tracking retired peers.
|
||||
- Smoke test health status transitions around peer lifecycle changes.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Runbook](../runbook.md)
|
||||
70
docs/features/peer-to-peer-gossip-sync.md
Normal file
70
docs/features/peer-to-peer-gossip-sync.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Feature: Peer-to-Peer Gossip Sync
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Peer discovery and sync orchestration.
|
||||
- Push/pull propagation of oplog changes.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Strong global consistency guarantees.
|
||||
- Public internet exposure without additional network controls.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Node starts and discovers peers.
|
||||
2. Node exchanges sync metadata with connected peers.
|
||||
3. Missing operations are requested and applied.
|
||||
4. Mesh converges over repeated gossip rounds.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- Sync orchestrator scheduling
|
||||
- TCP peer sync channels
|
||||
- Vector clock exchange and reconciliation
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Peers with valid authentication token can exchange replicated collection data.
|
||||
- Cluster membership should be restricted to trusted nodes.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Network reachability
|
||||
- Shared authentication material
|
||||
- Healthy persistence layer
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Peer isolation due to network outage
|
||||
- Token mismatch blocking synchronization
|
||||
- Sustained lag under high write pressure
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Monitor `laggingPeers`, `maxLagMs`, and active peer counts.
|
||||
- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Roll out protocol-affecting changes in a controlled window.
|
||||
- Confirm backward/forward compatibility in staging mesh before production rollout.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Run multi-node integration tests with controlled partitions.
|
||||
- Validate eventual convergence after reconnect.
|
||||
- Verify no data-loss under repeated reconnect scenarios.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Access and Permissions](../access.md)
|
||||
67
docs/features/secure-peer-transport.md
Normal file
67
docs/features/secure-peer-transport.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# Feature: Secure Peer Transport
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Protect replicated data in transit with authenticated and encrypted peer communication.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Secure handshake and key establishment.
|
||||
- Message confidentiality and integrity controls.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Data-at-rest encryption.
|
||||
- Full identity and certificate lifecycle management.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Operator enables secure transport components.
|
||||
2. Peers perform handshake and establish session keys.
|
||||
3. Replication traffic is encrypted/authenticated.
|
||||
4. Health and logs expose secure mode status.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- `IPeerHandshakeService` / secure handshake implementation
|
||||
- Network pipeline message encryption and HMAC validation
|
||||
- Startup configuration for secure mode
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators.
|
||||
- Logs must avoid plaintext secret disclosure.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Consistent security mode across peers
|
||||
- Valid runtime cryptographic dependencies
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Secure/plaintext mode mismatch
|
||||
- Handshake failure due to key/token mismatch
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Alert on repeated handshake failures.
|
||||
- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Enable secure mode in staging first.
|
||||
- Roll production nodes in controlled order to avoid mixed-mode partitions.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Add tests for secure-to-secure success and mixed-mode rejection.
|
||||
- Validate encrypted cluster startup and sync with production-like load.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Access and Permissions](../access.md)
|
||||
69
docs/features/selective-collection-sync.md
Normal file
69
docs/features/selective-collection-sync.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Feature: Selective Collection Sync
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Register collections for replication using `WatchCollection()`.
|
||||
- Replicate changes for registered collections across peers.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Automatic replication of all database collections.
|
||||
- Schema migration management.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Developer registers target collections in the document store.
|
||||
2. Local writes trigger CDC events.
|
||||
3. Oplog entries propagate through peer sync.
|
||||
4. Remote peers apply updates for matching collections.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- `WatchCollection(collectionName, collection, keySelector)`
|
||||
- CDC trigger pipeline
|
||||
- Oplog append and apply operations
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Access to source collections is controlled by host application permissions.
|
||||
- Only approved collections should be registered for sync in sensitive environments.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Correct collection registration
|
||||
- Stable peer connectivity
|
||||
- Persistence availability
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Missed replication due to unregistered collection
|
||||
- Delayed propagation during network partition
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Monitor replication lag and peer confirmation metrics.
|
||||
- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Introduce new synced collections behind staged rollout.
|
||||
- Validate downstream consumer compatibility before production enablement.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Add integration tests verifying only registered collections replicate.
|
||||
- Smoke test by writing to registered and non-registered collections and confirming expected behavior.
|
||||
- Validate no unexpected collection appears in remote peers after deployment.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Access and Permissions](../access.md)
|
||||
Reference in New Issue
Block a user