docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s
All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
164
README.md
164
README.md
@@ -5,7 +5,6 @@
|
||||
**Peer-to-Peer Data Synchronization Middleware for .NET**
|
||||
|
||||
[](https://dotnet.microsoft.com/)
|
||||
[](LICENSE)
|
||||
|
||||
## Status
|
||||

|
||||
@@ -22,9 +21,18 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Ownership and Support](#ownership-and-support)
|
||||
- [Architecture](#architecture)
|
||||
- [Key Features](#key-features)
|
||||
- [Installation](#installation)
|
||||
- [Prerequisites](#prerequisites)
|
||||
- [Configuration and Secrets](#configuration-and-secrets)
|
||||
- [Build, Test, and Quality Gates](#build-test-and-quality-gates)
|
||||
- [Deployment and Rollback](#deployment-and-rollback)
|
||||
- [Operations and Incident Response](#operations-and-incident-response)
|
||||
- [Security and Compliance Posture](#security-and-compliance-posture)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Change Governance](#change-governance)
|
||||
- [Quick Start](#quick-start)
|
||||
- [Integrating with Your Database](#integrating-with-your-database)
|
||||
- [Cloud Deployment](#cloud-deployment)
|
||||
@@ -32,7 +40,6 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
|
||||
- [Use Cases](#use-cases)
|
||||
- [Documentation](#documentation)
|
||||
- [Contributing](#contributing)
|
||||
- [License](#license)
|
||||
|
||||
---
|
||||
|
||||
@@ -50,6 +57,15 @@ Your application continues to read and write to its database as usual. CBDDC wor
|
||||
|
||||
---
|
||||
|
||||
## Ownership and Support
|
||||
|
||||
- **Owning team**: CBDDC Core Maintainers
|
||||
- **Issue reporting**: [GitHub Issues](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues)
|
||||
- **Usage/support questions**: [GitHub Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
|
||||
- **Incident escalation for active deployments**: Follow your local on-call process first, then attach CBDDC diagnostics from `docs/runbook.md`
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
@@ -163,6 +179,107 @@ dotnet add package CBDDC.Network
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before onboarding a new node, verify:
|
||||
|
||||
- .NET SDK/runtime required by selected packages (recommended: .NET 8+)
|
||||
- Network access for configured TCP/UDP sync ports
|
||||
- Persistent storage path with write permissions for database and backups
|
||||
- Access to deployment secrets (cluster `AuthToken`, connection strings, environment-specific config)
|
||||
- Ability to run health checks (`/health` in hosted mode)
|
||||
|
||||
---
|
||||
|
||||
## Configuration and Secrets
|
||||
|
||||
CBDDC runtime behavior should be configured from environment-specific settings, not hardcoded values.
|
||||
|
||||
- Required runtime settings:
|
||||
- `NodeId`
|
||||
- `TcpPort` (and `UdpPort` when discovery is enabled)
|
||||
- `AuthToken` (secret; do not commit)
|
||||
- Persistence connection/database path
|
||||
- Secret handling guidance:
|
||||
- Keep secrets in a secret manager or protected environment variables
|
||||
- Rotate tokens during maintenance windows and validate mesh convergence after rotation
|
||||
- Avoid exposing tokens in logs, screenshots, or troubleshooting output
|
||||
|
||||
See `docs/security.md` and `docs/production-hardening.md` for detailed control guidance.
|
||||
|
||||
---
|
||||
|
||||
## Build, Test, and Quality Gates
|
||||
|
||||
Run these checks before merge or release:
|
||||
|
||||
```bash
|
||||
dotnet restore
|
||||
dotnet build
|
||||
dotnet test
|
||||
```
|
||||
|
||||
Recommended release gates:
|
||||
|
||||
- Unit/integration tests pass
|
||||
- Health endpoint reports healthy in staging after deployment
|
||||
- No unresolved high-severity issues in deployment or runbook checklists
|
||||
|
||||
---
|
||||
|
||||
## Deployment and Rollback
|
||||
|
||||
Canonical deployment procedures are documented in:
|
||||
|
||||
- `docs/deployment.md` (promotion flow, validation gates, rollback path)
|
||||
- `docs/deployment-lan.md` (LAN deployment details)
|
||||
- `docs/deployment-modes.md` (hosted mode behavior)
|
||||
- `docs/production-hardening.md` (operational hardening controls)
|
||||
|
||||
---
|
||||
|
||||
## Operations and Incident Response
|
||||
|
||||
Use `docs/runbook.md` as the primary operational runbook. It includes:
|
||||
|
||||
- Alert triage sequence
|
||||
- Escalation path and severity mapping
|
||||
- Recovery and verification steps
|
||||
- Post-incident follow-up expectations
|
||||
|
||||
For peer lifecycle incidents, use `docs/peer-deprecation-removal-runbook.md`.
|
||||
|
||||
---
|
||||
|
||||
## Security and Compliance Posture
|
||||
|
||||
- Primary protections: authenticated peer communication and encrypted transport options in CBDDC networking components
|
||||
- Sensitive data handling: classify data by environment and apply least privilege on runtime access
|
||||
- Operational controls: hardening, key rotation, and monitoring requirements are documented in `docs/security.md` and `docs/access.md`
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Common issue patterns and step-by-step remediation are documented in `docs/troubleshooting.md`.
|
||||
|
||||
High-priority troubleshooting topics:
|
||||
|
||||
- Peer unreachable or lagging confirmation state
|
||||
- Persistence faults and backup restore flow
|
||||
- Authentication mismatch or configuration drift
|
||||
|
||||
---
|
||||
|
||||
## Change Governance
|
||||
|
||||
- Use short-lived branches and pull requests for every change
|
||||
- Require review before merge to protected branches
|
||||
- Run build/test/health validation before release
|
||||
- Record release-impacting changes in `CHANGELOG.md` and link related deployment notes
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Define Your Database Context
|
||||
@@ -608,13 +725,14 @@ Console.WriteLine($"Peers: {status.ConnectedPeers}");
|
||||
|
||||
- **[Architecture & Concepts](docs/architecture.md)** - HLC, Gossip, Vector Clocks, Hash Chains
|
||||
- **[Conflict Resolution](docs/conflict-resolution.md)** - LWW vs Recursive Merge
|
||||
- **[Oplog & CDC](docs/oplog-cdc.md)** - How change tracking works
|
||||
- **[Oplog & CDC Design](docs/database-sync-manager-design.md)** - How change tracking works
|
||||
|
||||
### Deployment
|
||||
|
||||
- **[Production Guide](docs/production-hardening.md)** - Configuration, monitoring, best practices
|
||||
- **[Cloud Deployment](docs/cloud-deployment.md)** - ASP.NET Core hosting
|
||||
- **[Deployment Runbook](docs/deployment.md)** - Promotion flow, validation, rollback
|
||||
- **[Deployment Modes](docs/deployment-modes.md)** - Single-cluster deployment strategy
|
||||
- **[Operations Runbook](docs/runbook.md)** - Incident triage and recovery
|
||||
|
||||
### API
|
||||
|
||||
@@ -711,41 +829,3 @@ dotnet run
|
||||
### Code of Conduct
|
||||
|
||||
Be respectful, inclusive, and constructive. We're all here to learn and build great software together.
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
CBDDC is licensed under the **MIT License**.
|
||||
|
||||
```
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2026 MrDevRobot
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software...
|
||||
```
|
||||
|
||||
See [LICENSE](LICENSE) file for full details.
|
||||
|
||||
---
|
||||
|
||||
## Give it a Star!
|
||||
|
||||
If you find CBDDC useful, please **give it a star** on GitHub! It helps others discover the project and motivates us to keep improving it.
|
||||
|
||||
<div align="center">
|
||||
|
||||
### [Star on GitHub](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
|
||||
|
||||
**Thank you for your support!**
|
||||
|
||||
**Built with care for the .NET community**
|
||||
|
||||
[Report Bug](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Request Feature](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
|
||||
|
||||
</div>
|
||||
|
||||
116
docs/README.md
116
docs/README.md
@@ -1,73 +1,45 @@
|
||||
# CBDDC Documentation
|
||||
|
||||
This folder contains the official documentation for CBDDC, published as GitHub Pages.
|
||||
|
||||
## Documentation Structure (v0.9.0)
|
||||
|
||||
### Getting Started
|
||||
- **[Getting Started](getting-started.md)** - Installation, setup, and first steps with CBDDC
|
||||
|
||||
### Core Documentation
|
||||
- **[Architecture](architecture.md)** - Hybrid Logical Clocks, Gossip Protocol, mesh networking
|
||||
- **[API Reference](api-reference.md)** - Complete API documentation with examples
|
||||
- **[Querying](querying.md)** - Data querying patterns and LINQ support
|
||||
|
||||
### Persistence & Storage
|
||||
- **[Persistence Providers](persistence-providers.md)** - SQLite, EF Core, PostgreSQL comparison
|
||||
- **[Deployment Modes](deployment-modes.md)** - Single-cluster deployment strategy
|
||||
|
||||
### Networking & Security
|
||||
- **[Security](security.md)** - Encryption, authentication, secure networking
|
||||
- **[Conflict Resolution](conflict-resolution.md)** - LWW and Recursive Merge strategies
|
||||
- **[Network Telemetry](network-telemetry.md)** - Monitoring and diagnostics
|
||||
- **[Dynamic Reconfiguration](dynamic-reconfiguration.md)** - Runtime configuration changes
|
||||
- **[Remote Peer Configuration](remote-peer-configuration.md)** - Managing remote peers and tracking lifecycle
|
||||
- **[Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)** - Rollout notes and adoption checklist
|
||||
# CBDDC Documentation Index
|
||||
|
||||
### Deployment & Operations
|
||||
- **[Deployment (LAN)](deployment-lan.md)** - Platform-specific deployment guide
|
||||
- **[Production Hardening](production-hardening.md)** - Configuration, monitoring, best practices
|
||||
- **[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)** - Operational workflow for de-tracking and removal
|
||||
|
||||
## Building the Documentation
|
||||
|
||||
This documentation uses Jekyll with the Cayman theme and is automatically published via GitHub Pages.
|
||||
|
||||
### Local Development
|
||||
|
||||
```bash
|
||||
# Install Jekyll (requires Ruby)
|
||||
gem install bundler jekyll
|
||||
|
||||
# Serve documentation locally
|
||||
cd docs
|
||||
jekyll serve
|
||||
|
||||
# Open http://localhost:4000
|
||||
```
|
||||
|
||||
### Site Configuration
|
||||
|
||||
- **_config.yml** - Jekyll configuration and site metadata
|
||||
- **_layouts/default.html** - Main page layout with navigation and styling
|
||||
- **_includes/nav.html** - Top navigation bar
|
||||
- **_data/navigation.yml** - Sidebar navigation structure per version
|
||||
|
||||
## Version History
|
||||
|
||||
The documentation supports multiple versions:
|
||||
- **v0.9** (current) - Latest stable release
|
||||
- **v0.8** - ASP.NET Core hosting, EF Core, PostgreSQL support
|
||||
- **v0.7** - Brotli compression, Protocol v4
|
||||
- **v0.6** - Secure networking, conflict resolution strategies
|
||||
|
||||
## Contributing to Documentation
|
||||
|
||||
1. **Fix typos or improve clarity** - Submit PRs directly
|
||||
2. **Add new guides** - Create a new .md file and update navigation.yml
|
||||
3. **Test locally** - Run Jekyll locally to preview changes before submitting
|
||||
|
||||
## Links
|
||||
|
||||
- [Main Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
|
||||
- [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
|
||||
This folder contains operational and engineering documentation for CBDDC.
|
||||
|
||||
## Core Documentation
|
||||
|
||||
- [Architecture](architecture.md)
|
||||
- [Deployment](deployment.md)
|
||||
- [Runbook](runbook.md)
|
||||
- [Security](security.md)
|
||||
- [Access and Permissions](access.md)
|
||||
- [Troubleshooting](troubleshooting.md)
|
||||
|
||||
## Feature Documentation
|
||||
|
||||
- [Feature Inventory](features/README.md)
|
||||
- [Selective Collection Sync](features/selective-collection-sync.md)
|
||||
- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
|
||||
- [Secure Peer Transport](features/secure-peer-transport.md)
|
||||
- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
|
||||
|
||||
## Additional Technical References
|
||||
|
||||
- [Getting Started](getting-started.md)
|
||||
- [API Reference](api-reference.md)
|
||||
- [Conflict Resolution](conflict-resolution.md)
|
||||
- [Persistence Providers](persistence-providers.md)
|
||||
- [Network Telemetry](network-telemetry.md)
|
||||
- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
|
||||
- [Remote Peer Configuration](remote-peer-configuration.md)
|
||||
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
|
||||
- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
|
||||
- [Database Sync Manager Design](database-sync-manager-design.md)
|
||||
|
||||
## Deployment-Specific Guides
|
||||
|
||||
- [Deployment Modes](deployment-modes.md)
|
||||
- [Deployment (LAN)](deployment-lan.md)
|
||||
- [Production Hardening](production-hardening.md)
|
||||
|
||||
## Documentation Conventions
|
||||
|
||||
- Use lowercase kebab-case for markdown files.
|
||||
- Keep canonical operational docs at the top level under `docs/`.
|
||||
- Store major feature documentation in `docs/features/` and keep `docs/features/README.md` updated.
|
||||
|
||||
44
docs/access.md
Normal file
44
docs/access.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Access and Permissions
|
||||
|
||||
This document defines the least-privilege access model for CBDDC environments.
|
||||
|
||||
## Roles
|
||||
|
||||
| Role | Typical Permissions | Approval Required |
|
||||
|------|---------------------|-------------------|
|
||||
| Runtime Operator | Read health/logs, restart service, run incident checks | Team lead or on-call manager |
|
||||
| Deployment Engineer | Deploy approved releases, update runtime configuration | Change approval for production |
|
||||
| Security Administrator | Manage secrets, rotate tokens, review access | Security approval |
|
||||
| Maintainer | Modify CBDDC source/docs, merge reviewed changes | Pull request review |
|
||||
|
||||
## Least-Privilege Rules
|
||||
|
||||
- Grant access by role, not by individual preference.
|
||||
- Use environment-specific credentials and scoped service accounts.
|
||||
- Do not share production credentials across environments.
|
||||
- Remove elevated access promptly after incident or change window.
|
||||
|
||||
## Approval Flow
|
||||
|
||||
1. Request access with role, environment, and business reason.
|
||||
2. Approver validates least-privilege scope.
|
||||
3. Access is granted with expiration date when applicable.
|
||||
4. Grant/revoke events are logged for auditability.
|
||||
|
||||
## Periodic Access Review
|
||||
|
||||
- Review active privileged access at least quarterly.
|
||||
- Remove dormant or unowned accounts immediately.
|
||||
- Validate that emergency access accounts are controlled and monitored.
|
||||
|
||||
## Secret Handling
|
||||
|
||||
- Store `AuthToken`, connection strings, and credentials in approved secret stores.
|
||||
- Never commit secrets to source control.
|
||||
- Rotate secrets after incidents and on scheduled cadence.
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Security](security.md)
|
||||
- [Runbook](runbook.md)
|
||||
- [Production Hardening](production-hardening.md)
|
||||
19
docs/adr/0001-peer-mesh-topology.md
Normal file
19
docs/adr/0001-peer-mesh-topology.md
Normal file
@@ -0,0 +1,19 @@
|
||||
# ADR 0001: Peer Mesh Topology
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
CBDDC requires local-first synchronization without a central coordinator in LAN-focused deployments.
|
||||
|
||||
## Decision
|
||||
|
||||
Adopt a peer mesh topology with gossip-based synchronization and per-node oplog tracking.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Improves resilience when individual nodes disconnect.
|
||||
- Requires strong operational tooling for lag, peer lifecycle, and pruning controls.
|
||||
- Increases importance of clear runbook and troubleshooting documentation.
|
||||
@@ -246,6 +246,6 @@ A: No. Conflict resolution assumes both documents have compatible schemas.
|
||||
---
|
||||
|
||||
**See Also:**
|
||||
- [Getting Started](getting-started.html)
|
||||
- [Architecture](architecture.html)
|
||||
- [Security](security.html)
|
||||
- [Getting Started](getting-started.md)
|
||||
- [Architecture](architecture.md)
|
||||
- [Security](security.md)
|
||||
|
||||
61
docs/deployment.md
Normal file
61
docs/deployment.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# Deployment
|
||||
|
||||
This document defines the canonical CBDDC deployment workflow, validation gates, and rollback procedures.
|
||||
|
||||
## Environments
|
||||
|
||||
- Local development: single node for functional verification.
|
||||
- Staging: multi-node mesh or hosted mode with production-like configuration.
|
||||
- Production: controlled rollout with health checks, monitoring, and rollback readiness.
|
||||
|
||||
## Promotion Workflow
|
||||
|
||||
1. Build and test in CI (`dotnet build`, `dotnet test`).
|
||||
2. Validate configuration and secrets for the target environment.
|
||||
3. Deploy to staging.
|
||||
4. Run smoke checks:
|
||||
- Node starts and joins expected peers.
|
||||
- `/health` reports healthy.
|
||||
- Sync operations replicate across target collections.
|
||||
5. Promote to production using approved change window.
|
||||
|
||||
## Validation Gates
|
||||
|
||||
- No failed test suites.
|
||||
- No unresolved critical incidents.
|
||||
- Health check status is healthy or approved degraded state with mitigation.
|
||||
- Backup/restore path confirmed for the active persistence provider.
|
||||
|
||||
## Rollback Triggers
|
||||
|
||||
Rollback immediately when any of the following occurs:
|
||||
|
||||
- Sync correctness regression (missed or duplicate replication events).
|
||||
- Persistent unhealthy status after remediation attempt.
|
||||
- Authentication or connectivity failure across required peers.
|
||||
- Data integrity concern reported by validation checks.
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
1. Stop traffic or disable rollout to newly deployed nodes.
|
||||
2. Revert to the last known-good build.
|
||||
3. Restore previous configuration and secret set.
|
||||
4. If needed, restore persistence snapshot/backup.
|
||||
5. Re-run health and replication smoke checks.
|
||||
6. Record incident details in the runbook timeline.
|
||||
|
||||
## Emergency Changes
|
||||
|
||||
Use emergency deployment only for incident containment:
|
||||
|
||||
1. Capture incident reference and approver.
|
||||
2. Apply minimal scoped change.
|
||||
3. Run abbreviated smoke checks (startup, health, critical replication path).
|
||||
4. Follow up with standard post-incident review.
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Deployment Modes](deployment-modes.md)
|
||||
- [Deployment (LAN)](deployment-lan.md)
|
||||
- [Production Hardening](production-hardening.md)
|
||||
- [Runbook](runbook.md)
|
||||
16
docs/features/README.md
Normal file
16
docs/features/README.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# Major Feature Inventory
|
||||
|
||||
This index tracks CBDDC major functionality. Each feature has one canonical document.
|
||||
|
||||
## Features
|
||||
|
||||
- [Selective Collection Sync](selective-collection-sync.md)
|
||||
- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
|
||||
- [Secure Peer Transport](secure-peer-transport.md)
|
||||
- [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
|
||||
|
||||
## Maintenance Rules
|
||||
|
||||
- Keep one file per major feature.
|
||||
- Update feature docs when APIs, operations, or security controls change.
|
||||
- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md).
|
||||
68
docs/features/peer-confirmed-pruning.md
Normal file
68
docs/features/peer-confirmed-pruning.md
Normal file
@@ -0,0 +1,68 @@
|
||||
# Feature: Peer-Confirmed Pruning
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Retention cutoff plus peer confirmation cutoff logic.
|
||||
- Operational controls for peer tracking and de-tracking.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Automatic removal of retired peers without operator action.
|
||||
- Replacement for backup/restore strategy.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Maintenance job calculates retention and confirmation cutoffs.
|
||||
2. System blocks pruning when required confirmations are missing.
|
||||
3. Operator de-tracks retired peers when appropriate.
|
||||
4. Pruning resumes once constraints are satisfied.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- Maintenance pruning scheduler
|
||||
- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`)
|
||||
- Health metrics for lag and missing confirmations
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Only approved operators should modify peer tracking state.
|
||||
- Pruning decisions must be auditable through logs and incident notes.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Accurate peer tracking metadata
|
||||
- Timely confirmation updates per source stream
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Pruning blocked indefinitely by stale peer tracking
|
||||
- Unsafe pruning if controls are bypassed
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Monitor peers with missing confirmations and sustained lag.
|
||||
- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Introduce pruning policy changes with explicit maintenance windows.
|
||||
- Validate expected cutoff behavior in staging before production rollout.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Add tests for blocked prune when confirmations are missing.
|
||||
- Add tests for resumed prune after de-tracking retired peers.
|
||||
- Smoke test health status transitions around peer lifecycle changes.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Runbook](../runbook.md)
|
||||
70
docs/features/peer-to-peer-gossip-sync.md
Normal file
70
docs/features/peer-to-peer-gossip-sync.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Feature: Peer-to-Peer Gossip Sync
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Peer discovery and sync orchestration.
|
||||
- Push/pull propagation of oplog changes.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Strong global consistency guarantees.
|
||||
- Public internet exposure without additional network controls.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Node starts and discovers peers.
|
||||
2. Node exchanges sync metadata with connected peers.
|
||||
3. Missing operations are requested and applied.
|
||||
4. Mesh converges over repeated gossip rounds.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- Sync orchestrator scheduling
|
||||
- TCP peer sync channels
|
||||
- Vector clock exchange and reconciliation
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Peers with valid authentication token can exchange replicated collection data.
|
||||
- Cluster membership should be restricted to trusted nodes.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Network reachability
|
||||
- Shared authentication material
|
||||
- Healthy persistence layer
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Peer isolation due to network outage
|
||||
- Token mismatch blocking synchronization
|
||||
- Sustained lag under high write pressure
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Monitor `laggingPeers`, `maxLagMs`, and active peer counts.
|
||||
- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Roll out protocol-affecting changes in a controlled window.
|
||||
- Confirm backward/forward compatibility in staging mesh before production rollout.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Run multi-node integration tests with controlled partitions.
|
||||
- Validate eventual convergence after reconnect.
|
||||
- Verify no data-loss under repeated reconnect scenarios.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Access and Permissions](../access.md)
|
||||
67
docs/features/secure-peer-transport.md
Normal file
67
docs/features/secure-peer-transport.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# Feature: Secure Peer Transport
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Protect replicated data in transit with authenticated and encrypted peer communication.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Secure handshake and key establishment.
|
||||
- Message confidentiality and integrity controls.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Data-at-rest encryption.
|
||||
- Full identity and certificate lifecycle management.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Operator enables secure transport components.
|
||||
2. Peers perform handshake and establish session keys.
|
||||
3. Replication traffic is encrypted/authenticated.
|
||||
4. Health and logs expose secure mode status.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- `IPeerHandshakeService` / secure handshake implementation
|
||||
- Network pipeline message encryption and HMAC validation
|
||||
- Startup configuration for secure mode
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators.
|
||||
- Logs must avoid plaintext secret disclosure.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Consistent security mode across peers
|
||||
- Valid runtime cryptographic dependencies
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Secure/plaintext mode mismatch
|
||||
- Handshake failure due to key/token mismatch
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Alert on repeated handshake failures.
|
||||
- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Enable secure mode in staging first.
|
||||
- Roll production nodes in controlled order to avoid mixed-mode partitions.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Add tests for secure-to-secure success and mixed-mode rejection.
|
||||
- Validate encrypted cluster startup and sync with production-like load.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Access and Permissions](../access.md)
|
||||
69
docs/features/selective-collection-sync.md
Normal file
69
docs/features/selective-collection-sync.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Feature: Selective Collection Sync
|
||||
|
||||
## Purpose and Business Outcome
|
||||
|
||||
Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data.
|
||||
|
||||
## Scope and Non-Goals
|
||||
|
||||
Scope:
|
||||
|
||||
- Register collections for replication using `WatchCollection()`.
|
||||
- Replicate changes for registered collections across peers.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Automatic replication of all database collections.
|
||||
- Schema migration management.
|
||||
|
||||
## User and System Workflows
|
||||
|
||||
1. Developer registers target collections in the document store.
|
||||
2. Local writes trigger CDC events.
|
||||
3. Oplog entries propagate through peer sync.
|
||||
4. Remote peers apply updates for matching collections.
|
||||
|
||||
## Interfaces, APIs, and Events Involved
|
||||
|
||||
- `WatchCollection(collectionName, collection, keySelector)`
|
||||
- CDC trigger pipeline
|
||||
- Oplog append and apply operations
|
||||
|
||||
## Permissions and Data Handling
|
||||
|
||||
- Access to source collections is controlled by host application permissions.
|
||||
- Only approved collections should be registered for sync in sensitive environments.
|
||||
|
||||
## Dependencies and Failure Modes
|
||||
|
||||
Dependencies:
|
||||
|
||||
- Correct collection registration
|
||||
- Stable peer connectivity
|
||||
- Persistence availability
|
||||
|
||||
Failure modes:
|
||||
|
||||
- Missed replication due to unregistered collection
|
||||
- Delayed propagation during network partition
|
||||
|
||||
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||
|
||||
- Monitor replication lag and peer confirmation metrics.
|
||||
- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response.
|
||||
|
||||
## Rollout and Change Considerations
|
||||
|
||||
- Introduce new synced collections behind staged rollout.
|
||||
- Validate downstream consumer compatibility before production enablement.
|
||||
|
||||
## Validation and Testability Guidance
|
||||
|
||||
- Add integration tests verifying only registered collections replicate.
|
||||
- Smoke test by writing to registered and non-registered collections and confirming expected behavior.
|
||||
- Validate no unexpected collection appears in remote peers after deployment.
|
||||
|
||||
## Related Security Controls
|
||||
|
||||
- [Security](../security.md)
|
||||
- [Access and Permissions](../access.md)
|
||||
@@ -179,10 +179,10 @@ Choose your strategy:
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [Architecture Overview](architecture.html) - Understand HLC, Gossip Protocol, and mesh networking
|
||||
- [Persistence Providers](persistence-providers.html) - Choose the right database for your deployment
|
||||
- [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
|
||||
- [Security Configuration](security.html) - Encryption and authentication
|
||||
- [Conflict Resolution Strategies](conflict-resolution.html) - LWW vs Recursive Merge
|
||||
- [Production Hardening](production-hardening.html) - Best practices and monitoring
|
||||
- [API Reference](api-reference.html) - Complete API documentation
|
||||
- [Architecture Overview](architecture.md) - Understand HLC, Gossip Protocol, and mesh networking
|
||||
- [Persistence Providers](persistence-providers.md) - Choose the right database for your deployment
|
||||
- [Deployment Modes](deployment-modes.md) - Single-cluster deployment strategy
|
||||
- [Security Configuration](security.md) - Encryption and authentication
|
||||
- [Conflict Resolution Strategies](conflict-resolution.md) - LWW vs Recursive Merge
|
||||
- [Production Hardening](production-hardening.md) - Best practices and monitoring
|
||||
- [API Reference](api-reference.md) - Complete API documentation
|
||||
|
||||
@@ -1,57 +1,44 @@
|
||||
---
|
||||
layout: default
|
||||
---
|
||||
|
||||
# CBDDC Documentation
|
||||
|
||||
Welcome to the CBDDC documentation for **version 0.9.0**.
|
||||
|
||||
## What's New in v0.9.0
|
||||
|
||||
Version 0.9.0 brings enhanced ASP.NET Core support, improved EF Core runtime stability, and refined synchronization and persistence layers.
|
||||
|
||||
- **ASP.NET Core Enhancements**: Improved sample application with better error handling
|
||||
- **EF Core Stability**: Fixed runtime issues and improved persistence layer reliability
|
||||
- **Sync & Persistence**: Enhanced stability across all persistence providers
|
||||
- **Production Ready**: All features tested and stable for production use
|
||||
|
||||
See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for complete details.
|
||||
|
||||
## Documentation
|
||||
|
||||
### Getting Started
|
||||
* [**Getting Started**](getting-started.html) - Installation, basic setup, and your first CBDDC application
|
||||
|
||||
### Core Concepts
|
||||
* [Architecture](architecture.html) - Understanding HLC, Gossip Protocol, and P2P mesh networking
|
||||
* [API Reference](api-reference.html) - Complete API documentation and examples
|
||||
* [Querying](querying.html) - Data querying patterns and LINQ support
|
||||
|
||||
### Persistence & Storage
|
||||
* [Persistence Providers](persistence-providers.html) - SQLite, EF Core, PostgreSQL comparison and configuration
|
||||
* [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
|
||||
|
||||
### Networking & Security
|
||||
* [Security](security.html) - Encryption, authentication, and secure networking
|
||||
* [Conflict Resolution](conflict-resolution.html) - LWW and Recursive Merge strategies
|
||||
* [Network Telemetry](network-telemetry.html) - Monitoring and diagnostics
|
||||
* [Dynamic Reconfiguration](dynamic-reconfiguration.html) - Runtime configuration and leader election
|
||||
* [Remote Peer Configuration](remote-peer-configuration.html) - Managing remote peers and tracking lifecycle
|
||||
* [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html) - Adoption and rollout notes
|
||||
---
|
||||
layout: default
|
||||
---
|
||||
|
||||
### Deployment & Operations
|
||||
* [Deployment (LAN)](deployment-lan.html) - Platform-specific deployment instructions
|
||||
* [Production Hardening](production-hardening.html) - Configuration, monitoring, and best practices
|
||||
* [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html) - Operational deprecation/removal workflow
|
||||
|
||||
## Previous Versions
|
||||
|
||||
- [v0.8.x Documentation](v0.8/getting-started.html)
|
||||
- [v0.7.x Documentation](v0.7/getting-started.html)
|
||||
- [v0.6.x Documentation](v0.6/getting-started.html)
|
||||
|
||||
## Links
|
||||
|
||||
* [GitHub Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
|
||||
* [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
|
||||
* [Changelog](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md)
|
||||
# CBDDC Documentation
|
||||
|
||||
Welcome to the CBDDC documentation site.
|
||||
|
||||
## Canonical Operational Docs
|
||||
|
||||
- [Architecture](architecture.md)
|
||||
- [Deployment](deployment.md)
|
||||
- [Runbook](runbook.md)
|
||||
- [Security](security.md)
|
||||
- [Access and Permissions](access.md)
|
||||
- [Troubleshooting](troubleshooting.md)
|
||||
|
||||
## Feature Docs
|
||||
|
||||
- [Feature Inventory](features/README.md)
|
||||
- [Selective Collection Sync](features/selective-collection-sync.md)
|
||||
- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
|
||||
- [Secure Peer Transport](features/secure-peer-transport.md)
|
||||
- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
|
||||
|
||||
## Additional References
|
||||
|
||||
- [Getting Started](getting-started.md)
|
||||
- [API Reference](api-reference.md)
|
||||
- [Persistence Providers](persistence-providers.md)
|
||||
- [Deployment Modes](deployment-modes.md)
|
||||
- [Deployment (LAN)](deployment-lan.md)
|
||||
- [Production Hardening](production-hardening.md)
|
||||
- [Conflict Resolution](conflict-resolution.md)
|
||||
- [Network Telemetry](network-telemetry.md)
|
||||
- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
|
||||
- [Remote Peer Configuration](remote-peer-configuration.md)
|
||||
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
|
||||
- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
|
||||
- [Database Sync Manager Design](database-sync-manager-design.md)
|
||||
|
||||
## Release History
|
||||
|
||||
See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for version-level changes.
|
||||
|
||||
@@ -104,5 +104,5 @@ Fields stored per peer:
|
||||
which removes the peer from active prune-gating.
|
||||
- Re-observed peers can be re-registered and become active for tracking again.
|
||||
- For rollout steps and production operations, see:
|
||||
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html)
|
||||
- [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html)
|
||||
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
|
||||
- [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)
|
||||
|
||||
58
docs/runbook.md
Normal file
58
docs/runbook.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Operations Runbook
|
||||
|
||||
This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
|
||||
|
||||
## Ownership and Escalation
|
||||
|
||||
- Service owner: CBDDC Core Maintainers.
|
||||
- First response: local platform/application on-call team for the affected deployment.
|
||||
- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
|
||||
|
||||
## Alert Triage
|
||||
|
||||
1. Identify severity based on impact:
|
||||
- Sev 1: Data integrity risk, sustained outage, or broad replication failure.
|
||||
- Sev 2: Partial sync degradation or prolonged peer lag.
|
||||
- Sev 3: Isolated node issue with workaround.
|
||||
2. Confirm current `cbddc` health check status and payload.
|
||||
3. Identify affected peers, collections, and first observed time.
|
||||
4. Apply the relevant recovery play below.
|
||||
|
||||
## Core Diagnostics
|
||||
|
||||
Capture these artifacts before remediation:
|
||||
|
||||
- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`).
|
||||
- Application logs for sync, persistence, and network components.
|
||||
- Current runtime configuration (excluding secrets).
|
||||
- Most recent deployment identifier and change window.
|
||||
|
||||
## Recovery Plays
|
||||
|
||||
### Peer unreachable or lagging
|
||||
|
||||
1. Verify network path and auth token consistency.
|
||||
2. Validate peer is still expected in topology.
|
||||
3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md).
|
||||
4. Recheck health status after remediation.
|
||||
|
||||
### Persistence failure
|
||||
|
||||
1. Verify storage path and permissions.
|
||||
2. Run integrity checks.
|
||||
3. Restore from latest valid backup if corruption is confirmed.
|
||||
4. Validate replication behavior after restore.
|
||||
|
||||
### Configuration drift
|
||||
|
||||
1. Compare deployed config to approved baseline.
|
||||
2. Reapply canonical settings.
|
||||
3. Restart affected service safely.
|
||||
4. Verify recovery with health and smoke checks.
|
||||
|
||||
## Post-Incident Actions
|
||||
|
||||
1. Record root cause and timeline.
|
||||
2. Add follow-up work items (tests, alerts, docs updates).
|
||||
3. Update affected feature docs and troubleshooting guidance.
|
||||
4. Confirm rollback and recovery instructions remain accurate.
|
||||
@@ -140,6 +140,6 @@ A: Security works on all target frameworks (netstandard2.0, net6.0, net8.0) with
|
||||
---
|
||||
|
||||
**See Also:**
|
||||
- [Getting Started](getting-started.html)
|
||||
- [Architecture](architecture.html)
|
||||
- [Production Hardening](production-hardening.html)
|
||||
- [Getting Started](getting-started.md)
|
||||
- [Architecture](architecture.md)
|
||||
- [Production Hardening](production-hardening.md)
|
||||
|
||||
86
docs/troubleshooting.md
Normal file
86
docs/troubleshooting.md
Normal file
@@ -0,0 +1,86 @@
|
||||
# Troubleshooting
|
||||
|
||||
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
|
||||
|
||||
## Peer Cannot Connect
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Node remains disconnected from expected peers.
|
||||
- Health check reports lagging or unconfirmed peers.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- Network path blocked (port/firewall mismatch).
|
||||
- `AuthToken` mismatch.
|
||||
- Peer configuration drift.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Verify TCP/UDP port configuration on both peers.
|
||||
2. Confirm shared token and node identity settings.
|
||||
3. Restart peer service and monitor logs.
|
||||
4. Recheck `cbddc` health payload.
|
||||
|
||||
## Replication Delay or Missing Updates
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Writes are visible locally but not on remote peers.
|
||||
- `maxLagMs` grows continuously.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- Retired peer still tracked and gating pruning.
|
||||
- High load or transient network instability.
|
||||
- Invalid collection watch configuration.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Confirm affected collections are registered with `WatchCollection()`.
|
||||
2. Inspect peer confirmation metrics.
|
||||
3. If needed, de-track retired peers using the runbook.
|
||||
4. Re-run smoke sync validation after changes.
|
||||
|
||||
## Persistence Errors
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Startup or write failures from persistence layer.
|
||||
- Unhealthy health check due to storage exceptions.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- File/path permission errors.
|
||||
- Storage corruption.
|
||||
- Misconfigured provider settings.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Validate storage path and runtime permissions.
|
||||
2. Run integrity checks.
|
||||
3. Restore from latest good backup if corruption is detected.
|
||||
4. Validate read/write and replication after restore.
|
||||
|
||||
## Configuration Regressions After Release
|
||||
|
||||
Symptoms:
|
||||
|
||||
- Behavior changed immediately after deployment.
|
||||
- Multiple nodes fail with same error pattern.
|
||||
|
||||
Likely causes:
|
||||
|
||||
- Incorrect environment variables or appsettings values.
|
||||
- Partial rollout with incompatible settings.
|
||||
|
||||
Resolution:
|
||||
|
||||
1. Compare deployed configuration to approved baseline.
|
||||
2. Roll back to last known-good release if production impact is high.
|
||||
3. Redeploy with corrected configuration.
|
||||
4. Document root cause and preventive controls.
|
||||
|
||||
## Escalation
|
||||
|
||||
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.
|
||||
@@ -66,4 +66,4 @@ If rollout exposes unexpected persistent degradation:
|
||||
3. Keep full peer removal for nodes that are confirmed decommissioned.
|
||||
|
||||
For a detailed operator procedure, see
|
||||
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html).
|
||||
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md).
|
||||
|
||||
Reference in New Issue
Block a user