docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s
All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
164
README.md
164
README.md
@@ -5,7 +5,6 @@
|
|||||||
**Peer-to-Peer Data Synchronization Middleware for .NET**
|
**Peer-to-Peer Data Synchronization Middleware for .NET**
|
||||||
|
|
||||||
[](https://dotnet.microsoft.com/)
|
[](https://dotnet.microsoft.com/)
|
||||||
[](LICENSE)
|
|
||||||
|
|
||||||
## Status
|
## Status
|
||||||

|

|
||||||
@@ -22,9 +21,18 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
|
|||||||
## Table of Contents
|
## Table of Contents
|
||||||
|
|
||||||
- [Overview](#overview)
|
- [Overview](#overview)
|
||||||
|
- [Ownership and Support](#ownership-and-support)
|
||||||
- [Architecture](#architecture)
|
- [Architecture](#architecture)
|
||||||
- [Key Features](#key-features)
|
- [Key Features](#key-features)
|
||||||
- [Installation](#installation)
|
- [Installation](#installation)
|
||||||
|
- [Prerequisites](#prerequisites)
|
||||||
|
- [Configuration and Secrets](#configuration-and-secrets)
|
||||||
|
- [Build, Test, and Quality Gates](#build-test-and-quality-gates)
|
||||||
|
- [Deployment and Rollback](#deployment-and-rollback)
|
||||||
|
- [Operations and Incident Response](#operations-and-incident-response)
|
||||||
|
- [Security and Compliance Posture](#security-and-compliance-posture)
|
||||||
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
- [Change Governance](#change-governance)
|
||||||
- [Quick Start](#quick-start)
|
- [Quick Start](#quick-start)
|
||||||
- [Integrating with Your Database](#integrating-with-your-database)
|
- [Integrating with Your Database](#integrating-with-your-database)
|
||||||
- [Cloud Deployment](#cloud-deployment)
|
- [Cloud Deployment](#cloud-deployment)
|
||||||
@@ -32,7 +40,6 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
|
|||||||
- [Use Cases](#use-cases)
|
- [Use Cases](#use-cases)
|
||||||
- [Documentation](#documentation)
|
- [Documentation](#documentation)
|
||||||
- [Contributing](#contributing)
|
- [Contributing](#contributing)
|
||||||
- [License](#license)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -50,6 +57,15 @@ Your application continues to read and write to its database as usual. CBDDC wor
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Ownership and Support
|
||||||
|
|
||||||
|
- **Owning team**: CBDDC Core Maintainers
|
||||||
|
- **Issue reporting**: [GitHub Issues](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues)
|
||||||
|
- **Usage/support questions**: [GitHub Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
|
||||||
|
- **Incident escalation for active deployments**: Follow your local on-call process first, then attach CBDDC diagnostics from `docs/runbook.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -163,6 +179,107 @@ dotnet add package CBDDC.Network
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
Before onboarding a new node, verify:
|
||||||
|
|
||||||
|
- .NET SDK/runtime required by selected packages (recommended: .NET 8+)
|
||||||
|
- Network access for configured TCP/UDP sync ports
|
||||||
|
- Persistent storage path with write permissions for database and backups
|
||||||
|
- Access to deployment secrets (cluster `AuthToken`, connection strings, environment-specific config)
|
||||||
|
- Ability to run health checks (`/health` in hosted mode)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration and Secrets
|
||||||
|
|
||||||
|
CBDDC runtime behavior should be configured from environment-specific settings, not hardcoded values.
|
||||||
|
|
||||||
|
- Required runtime settings:
|
||||||
|
- `NodeId`
|
||||||
|
- `TcpPort` (and `UdpPort` when discovery is enabled)
|
||||||
|
- `AuthToken` (secret; do not commit)
|
||||||
|
- Persistence connection/database path
|
||||||
|
- Secret handling guidance:
|
||||||
|
- Keep secrets in a secret manager or protected environment variables
|
||||||
|
- Rotate tokens during maintenance windows and validate mesh convergence after rotation
|
||||||
|
- Avoid exposing tokens in logs, screenshots, or troubleshooting output
|
||||||
|
|
||||||
|
See `docs/security.md` and `docs/production-hardening.md` for detailed control guidance.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Build, Test, and Quality Gates
|
||||||
|
|
||||||
|
Run these checks before merge or release:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet restore
|
||||||
|
dotnet build
|
||||||
|
dotnet test
|
||||||
|
```
|
||||||
|
|
||||||
|
Recommended release gates:
|
||||||
|
|
||||||
|
- Unit/integration tests pass
|
||||||
|
- Health endpoint reports healthy in staging after deployment
|
||||||
|
- No unresolved high-severity issues in deployment or runbook checklists
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Deployment and Rollback
|
||||||
|
|
||||||
|
Canonical deployment procedures are documented in:
|
||||||
|
|
||||||
|
- `docs/deployment.md` (promotion flow, validation gates, rollback path)
|
||||||
|
- `docs/deployment-lan.md` (LAN deployment details)
|
||||||
|
- `docs/deployment-modes.md` (hosted mode behavior)
|
||||||
|
- `docs/production-hardening.md` (operational hardening controls)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Operations and Incident Response
|
||||||
|
|
||||||
|
Use `docs/runbook.md` as the primary operational runbook. It includes:
|
||||||
|
|
||||||
|
- Alert triage sequence
|
||||||
|
- Escalation path and severity mapping
|
||||||
|
- Recovery and verification steps
|
||||||
|
- Post-incident follow-up expectations
|
||||||
|
|
||||||
|
For peer lifecycle incidents, use `docs/peer-deprecation-removal-runbook.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security and Compliance Posture
|
||||||
|
|
||||||
|
- Primary protections: authenticated peer communication and encrypted transport options in CBDDC networking components
|
||||||
|
- Sensitive data handling: classify data by environment and apply least privilege on runtime access
|
||||||
|
- Operational controls: hardening, key rotation, and monitoring requirements are documented in `docs/security.md` and `docs/access.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
Common issue patterns and step-by-step remediation are documented in `docs/troubleshooting.md`.
|
||||||
|
|
||||||
|
High-priority troubleshooting topics:
|
||||||
|
|
||||||
|
- Peer unreachable or lagging confirmation state
|
||||||
|
- Persistence faults and backup restore flow
|
||||||
|
- Authentication mismatch or configuration drift
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change Governance
|
||||||
|
|
||||||
|
- Use short-lived branches and pull requests for every change
|
||||||
|
- Require review before merge to protected branches
|
||||||
|
- Run build/test/health validation before release
|
||||||
|
- Record release-impacting changes in `CHANGELOG.md` and link related deployment notes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
### 1. Define Your Database Context
|
### 1. Define Your Database Context
|
||||||
@@ -608,13 +725,14 @@ Console.WriteLine($"Peers: {status.ConnectedPeers}");
|
|||||||
|
|
||||||
- **[Architecture & Concepts](docs/architecture.md)** - HLC, Gossip, Vector Clocks, Hash Chains
|
- **[Architecture & Concepts](docs/architecture.md)** - HLC, Gossip, Vector Clocks, Hash Chains
|
||||||
- **[Conflict Resolution](docs/conflict-resolution.md)** - LWW vs Recursive Merge
|
- **[Conflict Resolution](docs/conflict-resolution.md)** - LWW vs Recursive Merge
|
||||||
- **[Oplog & CDC](docs/oplog-cdc.md)** - How change tracking works
|
- **[Oplog & CDC Design](docs/database-sync-manager-design.md)** - How change tracking works
|
||||||
|
|
||||||
### Deployment
|
### Deployment
|
||||||
|
|
||||||
- **[Production Guide](docs/production-hardening.md)** - Configuration, monitoring, best practices
|
- **[Production Guide](docs/production-hardening.md)** - Configuration, monitoring, best practices
|
||||||
- **[Cloud Deployment](docs/cloud-deployment.md)** - ASP.NET Core hosting
|
- **[Deployment Runbook](docs/deployment.md)** - Promotion flow, validation, rollback
|
||||||
- **[Deployment Modes](docs/deployment-modes.md)** - Single-cluster deployment strategy
|
- **[Deployment Modes](docs/deployment-modes.md)** - Single-cluster deployment strategy
|
||||||
|
- **[Operations Runbook](docs/runbook.md)** - Incident triage and recovery
|
||||||
|
|
||||||
### API
|
### API
|
||||||
|
|
||||||
@@ -711,41 +829,3 @@ dotnet run
|
|||||||
### Code of Conduct
|
### Code of Conduct
|
||||||
|
|
||||||
Be respectful, inclusive, and constructive. We're all here to learn and build great software together.
|
Be respectful, inclusive, and constructive. We're all here to learn and build great software together.
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
CBDDC is licensed under the **MIT License**.
|
|
||||||
|
|
||||||
```
|
|
||||||
MIT License
|
|
||||||
|
|
||||||
Copyright (c) 2026 MrDevRobot
|
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
|
||||||
in the Software without restriction, including without limitation the rights
|
|
||||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
||||||
copies of the Software...
|
|
||||||
```
|
|
||||||
|
|
||||||
See [LICENSE](LICENSE) file for full details.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Give it a Star!
|
|
||||||
|
|
||||||
If you find CBDDC useful, please **give it a star** on GitHub! It helps others discover the project and motivates us to keep improving it.
|
|
||||||
|
|
||||||
<div align="center">
|
|
||||||
|
|
||||||
### [Star on GitHub](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
|
|
||||||
|
|
||||||
**Thank you for your support!**
|
|
||||||
|
|
||||||
**Built with care for the .NET community**
|
|
||||||
|
|
||||||
[Report Bug](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Request Feature](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
|
|
||||||
|
|
||||||
</div>
|
|
||||||
|
|||||||
116
docs/README.md
116
docs/README.md
@@ -1,73 +1,45 @@
|
|||||||
# CBDDC Documentation
|
# CBDDC Documentation Index
|
||||||
|
|
||||||
This folder contains the official documentation for CBDDC, published as GitHub Pages.
|
|
||||||
|
|
||||||
## Documentation Structure (v0.9.0)
|
|
||||||
|
|
||||||
### Getting Started
|
|
||||||
- **[Getting Started](getting-started.md)** - Installation, setup, and first steps with CBDDC
|
|
||||||
|
|
||||||
### Core Documentation
|
|
||||||
- **[Architecture](architecture.md)** - Hybrid Logical Clocks, Gossip Protocol, mesh networking
|
|
||||||
- **[API Reference](api-reference.md)** - Complete API documentation with examples
|
|
||||||
- **[Querying](querying.md)** - Data querying patterns and LINQ support
|
|
||||||
|
|
||||||
### Persistence & Storage
|
|
||||||
- **[Persistence Providers](persistence-providers.md)** - SQLite, EF Core, PostgreSQL comparison
|
|
||||||
- **[Deployment Modes](deployment-modes.md)** - Single-cluster deployment strategy
|
|
||||||
|
|
||||||
### Networking & Security
|
|
||||||
- **[Security](security.md)** - Encryption, authentication, secure networking
|
|
||||||
- **[Conflict Resolution](conflict-resolution.md)** - LWW and Recursive Merge strategies
|
|
||||||
- **[Network Telemetry](network-telemetry.md)** - Monitoring and diagnostics
|
|
||||||
- **[Dynamic Reconfiguration](dynamic-reconfiguration.md)** - Runtime configuration changes
|
|
||||||
- **[Remote Peer Configuration](remote-peer-configuration.md)** - Managing remote peers and tracking lifecycle
|
|
||||||
- **[Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)** - Rollout notes and adoption checklist
|
|
||||||
|
|
||||||
### Deployment & Operations
|
This folder contains operational and engineering documentation for CBDDC.
|
||||||
- **[Deployment (LAN)](deployment-lan.md)** - Platform-specific deployment guide
|
|
||||||
- **[Production Hardening](production-hardening.md)** - Configuration, monitoring, best practices
|
## Core Documentation
|
||||||
- **[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)** - Operational workflow for de-tracking and removal
|
|
||||||
|
- [Architecture](architecture.md)
|
||||||
## Building the Documentation
|
- [Deployment](deployment.md)
|
||||||
|
- [Runbook](runbook.md)
|
||||||
This documentation uses Jekyll with the Cayman theme and is automatically published via GitHub Pages.
|
- [Security](security.md)
|
||||||
|
- [Access and Permissions](access.md)
|
||||||
### Local Development
|
- [Troubleshooting](troubleshooting.md)
|
||||||
|
|
||||||
```bash
|
## Feature Documentation
|
||||||
# Install Jekyll (requires Ruby)
|
|
||||||
gem install bundler jekyll
|
- [Feature Inventory](features/README.md)
|
||||||
|
- [Selective Collection Sync](features/selective-collection-sync.md)
|
||||||
# Serve documentation locally
|
- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
|
||||||
cd docs
|
- [Secure Peer Transport](features/secure-peer-transport.md)
|
||||||
jekyll serve
|
- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
|
||||||
|
|
||||||
# Open http://localhost:4000
|
## Additional Technical References
|
||||||
```
|
|
||||||
|
- [Getting Started](getting-started.md)
|
||||||
### Site Configuration
|
- [API Reference](api-reference.md)
|
||||||
|
- [Conflict Resolution](conflict-resolution.md)
|
||||||
- **_config.yml** - Jekyll configuration and site metadata
|
- [Persistence Providers](persistence-providers.md)
|
||||||
- **_layouts/default.html** - Main page layout with navigation and styling
|
- [Network Telemetry](network-telemetry.md)
|
||||||
- **_includes/nav.html** - Top navigation bar
|
- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
|
||||||
- **_data/navigation.yml** - Sidebar navigation structure per version
|
- [Remote Peer Configuration](remote-peer-configuration.md)
|
||||||
|
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
|
||||||
## Version History
|
- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
|
||||||
|
- [Database Sync Manager Design](database-sync-manager-design.md)
|
||||||
The documentation supports multiple versions:
|
|
||||||
- **v0.9** (current) - Latest stable release
|
## Deployment-Specific Guides
|
||||||
- **v0.8** - ASP.NET Core hosting, EF Core, PostgreSQL support
|
|
||||||
- **v0.7** - Brotli compression, Protocol v4
|
- [Deployment Modes](deployment-modes.md)
|
||||||
- **v0.6** - Secure networking, conflict resolution strategies
|
- [Deployment (LAN)](deployment-lan.md)
|
||||||
|
- [Production Hardening](production-hardening.md)
|
||||||
## Contributing to Documentation
|
|
||||||
|
## Documentation Conventions
|
||||||
1. **Fix typos or improve clarity** - Submit PRs directly
|
|
||||||
2. **Add new guides** - Create a new .md file and update navigation.yml
|
- Use lowercase kebab-case for markdown files.
|
||||||
3. **Test locally** - Run Jekyll locally to preview changes before submitting
|
- Keep canonical operational docs at the top level under `docs/`.
|
||||||
|
- Store major feature documentation in `docs/features/` and keep `docs/features/README.md` updated.
|
||||||
## Links
|
|
||||||
|
|
||||||
- [Main Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
|
|
||||||
- [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
|
|
||||||
|
|||||||
44
docs/access.md
Normal file
44
docs/access.md
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
# Access and Permissions
|
||||||
|
|
||||||
|
This document defines the least-privilege access model for CBDDC environments.
|
||||||
|
|
||||||
|
## Roles
|
||||||
|
|
||||||
|
| Role | Typical Permissions | Approval Required |
|
||||||
|
|------|---------------------|-------------------|
|
||||||
|
| Runtime Operator | Read health/logs, restart service, run incident checks | Team lead or on-call manager |
|
||||||
|
| Deployment Engineer | Deploy approved releases, update runtime configuration | Change approval for production |
|
||||||
|
| Security Administrator | Manage secrets, rotate tokens, review access | Security approval |
|
||||||
|
| Maintainer | Modify CBDDC source/docs, merge reviewed changes | Pull request review |
|
||||||
|
|
||||||
|
## Least-Privilege Rules
|
||||||
|
|
||||||
|
- Grant access by role, not by individual preference.
|
||||||
|
- Use environment-specific credentials and scoped service accounts.
|
||||||
|
- Do not share production credentials across environments.
|
||||||
|
- Remove elevated access promptly after incident or change window.
|
||||||
|
|
||||||
|
## Approval Flow
|
||||||
|
|
||||||
|
1. Request access with role, environment, and business reason.
|
||||||
|
2. Approver validates least-privilege scope.
|
||||||
|
3. Access is granted with expiration date when applicable.
|
||||||
|
4. Grant/revoke events are logged for auditability.
|
||||||
|
|
||||||
|
## Periodic Access Review
|
||||||
|
|
||||||
|
- Review active privileged access at least quarterly.
|
||||||
|
- Remove dormant or unowned accounts immediately.
|
||||||
|
- Validate that emergency access accounts are controlled and monitored.
|
||||||
|
|
||||||
|
## Secret Handling
|
||||||
|
|
||||||
|
- Store `AuthToken`, connection strings, and credentials in approved secret stores.
|
||||||
|
- Never commit secrets to source control.
|
||||||
|
- Rotate secrets after incidents and on scheduled cadence.
|
||||||
|
|
||||||
|
## Related Documents
|
||||||
|
|
||||||
|
- [Security](security.md)
|
||||||
|
- [Runbook](runbook.md)
|
||||||
|
- [Production Hardening](production-hardening.md)
|
||||||
19
docs/adr/0001-peer-mesh-topology.md
Normal file
19
docs/adr/0001-peer-mesh-topology.md
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
# ADR 0001: Peer Mesh Topology
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
CBDDC requires local-first synchronization without a central coordinator in LAN-focused deployments.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Adopt a peer mesh topology with gossip-based synchronization and per-node oplog tracking.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Improves resilience when individual nodes disconnect.
|
||||||
|
- Requires strong operational tooling for lag, peer lifecycle, and pruning controls.
|
||||||
|
- Increases importance of clear runbook and troubleshooting documentation.
|
||||||
@@ -246,6 +246,6 @@ A: No. Conflict resolution assumes both documents have compatible schemas.
|
|||||||
---
|
---
|
||||||
|
|
||||||
**See Also:**
|
**See Also:**
|
||||||
- [Getting Started](getting-started.html)
|
- [Getting Started](getting-started.md)
|
||||||
- [Architecture](architecture.html)
|
- [Architecture](architecture.md)
|
||||||
- [Security](security.html)
|
- [Security](security.md)
|
||||||
|
|||||||
61
docs/deployment.md
Normal file
61
docs/deployment.md
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
# Deployment
|
||||||
|
|
||||||
|
This document defines the canonical CBDDC deployment workflow, validation gates, and rollback procedures.
|
||||||
|
|
||||||
|
## Environments
|
||||||
|
|
||||||
|
- Local development: single node for functional verification.
|
||||||
|
- Staging: multi-node mesh or hosted mode with production-like configuration.
|
||||||
|
- Production: controlled rollout with health checks, monitoring, and rollback readiness.
|
||||||
|
|
||||||
|
## Promotion Workflow
|
||||||
|
|
||||||
|
1. Build and test in CI (`dotnet build`, `dotnet test`).
|
||||||
|
2. Validate configuration and secrets for the target environment.
|
||||||
|
3. Deploy to staging.
|
||||||
|
4. Run smoke checks:
|
||||||
|
- Node starts and joins expected peers.
|
||||||
|
- `/health` reports healthy.
|
||||||
|
- Sync operations replicate across target collections.
|
||||||
|
5. Promote to production using approved change window.
|
||||||
|
|
||||||
|
## Validation Gates
|
||||||
|
|
||||||
|
- No failed test suites.
|
||||||
|
- No unresolved critical incidents.
|
||||||
|
- Health check status is healthy or approved degraded state with mitigation.
|
||||||
|
- Backup/restore path confirmed for the active persistence provider.
|
||||||
|
|
||||||
|
## Rollback Triggers
|
||||||
|
|
||||||
|
Rollback immediately when any of the following occurs:
|
||||||
|
|
||||||
|
- Sync correctness regression (missed or duplicate replication events).
|
||||||
|
- Persistent unhealthy status after remediation attempt.
|
||||||
|
- Authentication or connectivity failure across required peers.
|
||||||
|
- Data integrity concern reported by validation checks.
|
||||||
|
|
||||||
|
## Rollback Procedure
|
||||||
|
|
||||||
|
1. Stop traffic or disable rollout to newly deployed nodes.
|
||||||
|
2. Revert to the last known-good build.
|
||||||
|
3. Restore previous configuration and secret set.
|
||||||
|
4. If needed, restore persistence snapshot/backup.
|
||||||
|
5. Re-run health and replication smoke checks.
|
||||||
|
6. Record incident details in the runbook timeline.
|
||||||
|
|
||||||
|
## Emergency Changes
|
||||||
|
|
||||||
|
Use emergency deployment only for incident containment:
|
||||||
|
|
||||||
|
1. Capture incident reference and approver.
|
||||||
|
2. Apply minimal scoped change.
|
||||||
|
3. Run abbreviated smoke checks (startup, health, critical replication path).
|
||||||
|
4. Follow up with standard post-incident review.
|
||||||
|
|
||||||
|
## Related Documents
|
||||||
|
|
||||||
|
- [Deployment Modes](deployment-modes.md)
|
||||||
|
- [Deployment (LAN)](deployment-lan.md)
|
||||||
|
- [Production Hardening](production-hardening.md)
|
||||||
|
- [Runbook](runbook.md)
|
||||||
16
docs/features/README.md
Normal file
16
docs/features/README.md
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
# Major Feature Inventory
|
||||||
|
|
||||||
|
This index tracks CBDDC major functionality. Each feature has one canonical document.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- [Selective Collection Sync](selective-collection-sync.md)
|
||||||
|
- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
|
||||||
|
- [Secure Peer Transport](secure-peer-transport.md)
|
||||||
|
- [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
|
||||||
|
|
||||||
|
## Maintenance Rules
|
||||||
|
|
||||||
|
- Keep one file per major feature.
|
||||||
|
- Update feature docs when APIs, operations, or security controls change.
|
||||||
|
- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md).
|
||||||
68
docs/features/peer-confirmed-pruning.md
Normal file
68
docs/features/peer-confirmed-pruning.md
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
# Feature: Peer-Confirmed Pruning
|
||||||
|
|
||||||
|
## Purpose and Business Outcome
|
||||||
|
|
||||||
|
Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams.
|
||||||
|
|
||||||
|
## Scope and Non-Goals
|
||||||
|
|
||||||
|
Scope:
|
||||||
|
|
||||||
|
- Retention cutoff plus peer confirmation cutoff logic.
|
||||||
|
- Operational controls for peer tracking and de-tracking.
|
||||||
|
|
||||||
|
Non-goals:
|
||||||
|
|
||||||
|
- Automatic removal of retired peers without operator action.
|
||||||
|
- Replacement for backup/restore strategy.
|
||||||
|
|
||||||
|
## User and System Workflows
|
||||||
|
|
||||||
|
1. Maintenance job calculates retention and confirmation cutoffs.
|
||||||
|
2. System blocks pruning when required confirmations are missing.
|
||||||
|
3. Operator de-tracks retired peers when appropriate.
|
||||||
|
4. Pruning resumes once constraints are satisfied.
|
||||||
|
|
||||||
|
## Interfaces, APIs, and Events Involved
|
||||||
|
|
||||||
|
- Maintenance pruning scheduler
|
||||||
|
- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`)
|
||||||
|
- Health metrics for lag and missing confirmations
|
||||||
|
|
||||||
|
## Permissions and Data Handling
|
||||||
|
|
||||||
|
- Only approved operators should modify peer tracking state.
|
||||||
|
- Pruning decisions must be auditable through logs and incident notes.
|
||||||
|
|
||||||
|
## Dependencies and Failure Modes
|
||||||
|
|
||||||
|
Dependencies:
|
||||||
|
|
||||||
|
- Accurate peer tracking metadata
|
||||||
|
- Timely confirmation updates per source stream
|
||||||
|
|
||||||
|
Failure modes:
|
||||||
|
|
||||||
|
- Pruning blocked indefinitely by stale peer tracking
|
||||||
|
- Unsafe pruning if controls are bypassed
|
||||||
|
|
||||||
|
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||||
|
|
||||||
|
- Monitor peers with missing confirmations and sustained lag.
|
||||||
|
- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions.
|
||||||
|
|
||||||
|
## Rollout and Change Considerations
|
||||||
|
|
||||||
|
- Introduce pruning policy changes with explicit maintenance windows.
|
||||||
|
- Validate expected cutoff behavior in staging before production rollout.
|
||||||
|
|
||||||
|
## Validation and Testability Guidance
|
||||||
|
|
||||||
|
- Add tests for blocked prune when confirmations are missing.
|
||||||
|
- Add tests for resumed prune after de-tracking retired peers.
|
||||||
|
- Smoke test health status transitions around peer lifecycle changes.
|
||||||
|
|
||||||
|
## Related Security Controls
|
||||||
|
|
||||||
|
- [Security](../security.md)
|
||||||
|
- [Runbook](../runbook.md)
|
||||||
70
docs/features/peer-to-peer-gossip-sync.md
Normal file
70
docs/features/peer-to-peer-gossip-sync.md
Normal file
@@ -0,0 +1,70 @@
|
|||||||
|
# Feature: Peer-to-Peer Gossip Sync
|
||||||
|
|
||||||
|
## Purpose and Business Outcome
|
||||||
|
|
||||||
|
Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity.
|
||||||
|
|
||||||
|
## Scope and Non-Goals
|
||||||
|
|
||||||
|
Scope:
|
||||||
|
|
||||||
|
- Peer discovery and sync orchestration.
|
||||||
|
- Push/pull propagation of oplog changes.
|
||||||
|
|
||||||
|
Non-goals:
|
||||||
|
|
||||||
|
- Strong global consistency guarantees.
|
||||||
|
- Public internet exposure without additional network controls.
|
||||||
|
|
||||||
|
## User and System Workflows
|
||||||
|
|
||||||
|
1. Node starts and discovers peers.
|
||||||
|
2. Node exchanges sync metadata with connected peers.
|
||||||
|
3. Missing operations are requested and applied.
|
||||||
|
4. Mesh converges over repeated gossip rounds.
|
||||||
|
|
||||||
|
## Interfaces, APIs, and Events Involved
|
||||||
|
|
||||||
|
- Sync orchestrator scheduling
|
||||||
|
- TCP peer sync channels
|
||||||
|
- Vector clock exchange and reconciliation
|
||||||
|
|
||||||
|
## Permissions and Data Handling
|
||||||
|
|
||||||
|
- Peers with valid authentication token can exchange replicated collection data.
|
||||||
|
- Cluster membership should be restricted to trusted nodes.
|
||||||
|
|
||||||
|
## Dependencies and Failure Modes
|
||||||
|
|
||||||
|
Dependencies:
|
||||||
|
|
||||||
|
- Network reachability
|
||||||
|
- Shared authentication material
|
||||||
|
- Healthy persistence layer
|
||||||
|
|
||||||
|
Failure modes:
|
||||||
|
|
||||||
|
- Peer isolation due to network outage
|
||||||
|
- Token mismatch blocking synchronization
|
||||||
|
- Sustained lag under high write pressure
|
||||||
|
|
||||||
|
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||||
|
|
||||||
|
- Monitor `laggingPeers`, `maxLagMs`, and active peer counts.
|
||||||
|
- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers.
|
||||||
|
|
||||||
|
## Rollout and Change Considerations
|
||||||
|
|
||||||
|
- Roll out protocol-affecting changes in a controlled window.
|
||||||
|
- Confirm backward/forward compatibility in staging mesh before production rollout.
|
||||||
|
|
||||||
|
## Validation and Testability Guidance
|
||||||
|
|
||||||
|
- Run multi-node integration tests with controlled partitions.
|
||||||
|
- Validate eventual convergence after reconnect.
|
||||||
|
- Verify no data-loss under repeated reconnect scenarios.
|
||||||
|
|
||||||
|
## Related Security Controls
|
||||||
|
|
||||||
|
- [Security](../security.md)
|
||||||
|
- [Access and Permissions](../access.md)
|
||||||
67
docs/features/secure-peer-transport.md
Normal file
67
docs/features/secure-peer-transport.md
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
# Feature: Secure Peer Transport
|
||||||
|
|
||||||
|
## Purpose and Business Outcome
|
||||||
|
|
||||||
|
Protect replicated data in transit with authenticated and encrypted peer communication.
|
||||||
|
|
||||||
|
## Scope and Non-Goals
|
||||||
|
|
||||||
|
Scope:
|
||||||
|
|
||||||
|
- Secure handshake and key establishment.
|
||||||
|
- Message confidentiality and integrity controls.
|
||||||
|
|
||||||
|
Non-goals:
|
||||||
|
|
||||||
|
- Data-at-rest encryption.
|
||||||
|
- Full identity and certificate lifecycle management.
|
||||||
|
|
||||||
|
## User and System Workflows
|
||||||
|
|
||||||
|
1. Operator enables secure transport components.
|
||||||
|
2. Peers perform handshake and establish session keys.
|
||||||
|
3. Replication traffic is encrypted/authenticated.
|
||||||
|
4. Health and logs expose secure mode status.
|
||||||
|
|
||||||
|
## Interfaces, APIs, and Events Involved
|
||||||
|
|
||||||
|
- `IPeerHandshakeService` / secure handshake implementation
|
||||||
|
- Network pipeline message encryption and HMAC validation
|
||||||
|
- Startup configuration for secure mode
|
||||||
|
|
||||||
|
## Permissions and Data Handling
|
||||||
|
|
||||||
|
- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators.
|
||||||
|
- Logs must avoid plaintext secret disclosure.
|
||||||
|
|
||||||
|
## Dependencies and Failure Modes
|
||||||
|
|
||||||
|
Dependencies:
|
||||||
|
|
||||||
|
- Consistent security mode across peers
|
||||||
|
- Valid runtime cryptographic dependencies
|
||||||
|
|
||||||
|
Failure modes:
|
||||||
|
|
||||||
|
- Secure/plaintext mode mismatch
|
||||||
|
- Handshake failure due to key/token mismatch
|
||||||
|
|
||||||
|
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||||
|
|
||||||
|
- Alert on repeated handshake failures.
|
||||||
|
- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation.
|
||||||
|
|
||||||
|
## Rollout and Change Considerations
|
||||||
|
|
||||||
|
- Enable secure mode in staging first.
|
||||||
|
- Roll production nodes in controlled order to avoid mixed-mode partitions.
|
||||||
|
|
||||||
|
## Validation and Testability Guidance
|
||||||
|
|
||||||
|
- Add tests for secure-to-secure success and mixed-mode rejection.
|
||||||
|
- Validate encrypted cluster startup and sync with production-like load.
|
||||||
|
|
||||||
|
## Related Security Controls
|
||||||
|
|
||||||
|
- [Security](../security.md)
|
||||||
|
- [Access and Permissions](../access.md)
|
||||||
69
docs/features/selective-collection-sync.md
Normal file
69
docs/features/selective-collection-sync.md
Normal file
@@ -0,0 +1,69 @@
|
|||||||
|
# Feature: Selective Collection Sync
|
||||||
|
|
||||||
|
## Purpose and Business Outcome
|
||||||
|
|
||||||
|
Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data.
|
||||||
|
|
||||||
|
## Scope and Non-Goals
|
||||||
|
|
||||||
|
Scope:
|
||||||
|
|
||||||
|
- Register collections for replication using `WatchCollection()`.
|
||||||
|
- Replicate changes for registered collections across peers.
|
||||||
|
|
||||||
|
Non-goals:
|
||||||
|
|
||||||
|
- Automatic replication of all database collections.
|
||||||
|
- Schema migration management.
|
||||||
|
|
||||||
|
## User and System Workflows
|
||||||
|
|
||||||
|
1. Developer registers target collections in the document store.
|
||||||
|
2. Local writes trigger CDC events.
|
||||||
|
3. Oplog entries propagate through peer sync.
|
||||||
|
4. Remote peers apply updates for matching collections.
|
||||||
|
|
||||||
|
## Interfaces, APIs, and Events Involved
|
||||||
|
|
||||||
|
- `WatchCollection(collectionName, collection, keySelector)`
|
||||||
|
- CDC trigger pipeline
|
||||||
|
- Oplog append and apply operations
|
||||||
|
|
||||||
|
## Permissions and Data Handling
|
||||||
|
|
||||||
|
- Access to source collections is controlled by host application permissions.
|
||||||
|
- Only approved collections should be registered for sync in sensitive environments.
|
||||||
|
|
||||||
|
## Dependencies and Failure Modes
|
||||||
|
|
||||||
|
Dependencies:
|
||||||
|
|
||||||
|
- Correct collection registration
|
||||||
|
- Stable peer connectivity
|
||||||
|
- Persistence availability
|
||||||
|
|
||||||
|
Failure modes:
|
||||||
|
|
||||||
|
- Missed replication due to unregistered collection
|
||||||
|
- Delayed propagation during network partition
|
||||||
|
|
||||||
|
## Monitoring, Alerts, and Troubleshooting Pointers
|
||||||
|
|
||||||
|
- Monitor replication lag and peer confirmation metrics.
|
||||||
|
- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response.
|
||||||
|
|
||||||
|
## Rollout and Change Considerations
|
||||||
|
|
||||||
|
- Introduce new synced collections behind staged rollout.
|
||||||
|
- Validate downstream consumer compatibility before production enablement.
|
||||||
|
|
||||||
|
## Validation and Testability Guidance
|
||||||
|
|
||||||
|
- Add integration tests verifying only registered collections replicate.
|
||||||
|
- Smoke test by writing to registered and non-registered collections and confirming expected behavior.
|
||||||
|
- Validate no unexpected collection appears in remote peers after deployment.
|
||||||
|
|
||||||
|
## Related Security Controls
|
||||||
|
|
||||||
|
- [Security](../security.md)
|
||||||
|
- [Access and Permissions](../access.md)
|
||||||
@@ -179,10 +179,10 @@ Choose your strategy:
|
|||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
- [Architecture Overview](architecture.html) - Understand HLC, Gossip Protocol, and mesh networking
|
- [Architecture Overview](architecture.md) - Understand HLC, Gossip Protocol, and mesh networking
|
||||||
- [Persistence Providers](persistence-providers.html) - Choose the right database for your deployment
|
- [Persistence Providers](persistence-providers.md) - Choose the right database for your deployment
|
||||||
- [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
|
- [Deployment Modes](deployment-modes.md) - Single-cluster deployment strategy
|
||||||
- [Security Configuration](security.html) - Encryption and authentication
|
- [Security Configuration](security.md) - Encryption and authentication
|
||||||
- [Conflict Resolution Strategies](conflict-resolution.html) - LWW vs Recursive Merge
|
- [Conflict Resolution Strategies](conflict-resolution.md) - LWW vs Recursive Merge
|
||||||
- [Production Hardening](production-hardening.html) - Best practices and monitoring
|
- [Production Hardening](production-hardening.md) - Best practices and monitoring
|
||||||
- [API Reference](api-reference.html) - Complete API documentation
|
- [API Reference](api-reference.md) - Complete API documentation
|
||||||
|
|||||||
@@ -1,57 +1,44 @@
|
|||||||
---
|
---
|
||||||
layout: default
|
layout: default
|
||||||
---
|
---
|
||||||
|
|
||||||
# CBDDC Documentation
|
|
||||||
|
|
||||||
Welcome to the CBDDC documentation for **version 0.9.0**.
|
|
||||||
|
|
||||||
## What's New in v0.9.0
|
|
||||||
|
|
||||||
Version 0.9.0 brings enhanced ASP.NET Core support, improved EF Core runtime stability, and refined synchronization and persistence layers.
|
|
||||||
|
|
||||||
- **ASP.NET Core Enhancements**: Improved sample application with better error handling
|
|
||||||
- **EF Core Stability**: Fixed runtime issues and improved persistence layer reliability
|
|
||||||
- **Sync & Persistence**: Enhanced stability across all persistence providers
|
|
||||||
- **Production Ready**: All features tested and stable for production use
|
|
||||||
|
|
||||||
See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for complete details.
|
|
||||||
|
|
||||||
## Documentation
|
|
||||||
|
|
||||||
### Getting Started
|
|
||||||
* [**Getting Started**](getting-started.html) - Installation, basic setup, and your first CBDDC application
|
|
||||||
|
|
||||||
### Core Concepts
|
|
||||||
* [Architecture](architecture.html) - Understanding HLC, Gossip Protocol, and P2P mesh networking
|
|
||||||
* [API Reference](api-reference.html) - Complete API documentation and examples
|
|
||||||
* [Querying](querying.html) - Data querying patterns and LINQ support
|
|
||||||
|
|
||||||
### Persistence & Storage
|
|
||||||
* [Persistence Providers](persistence-providers.html) - SQLite, EF Core, PostgreSQL comparison and configuration
|
|
||||||
* [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
|
|
||||||
|
|
||||||
### Networking & Security
|
|
||||||
* [Security](security.html) - Encryption, authentication, and secure networking
|
|
||||||
* [Conflict Resolution](conflict-resolution.html) - LWW and Recursive Merge strategies
|
|
||||||
* [Network Telemetry](network-telemetry.html) - Monitoring and diagnostics
|
|
||||||
* [Dynamic Reconfiguration](dynamic-reconfiguration.html) - Runtime configuration and leader election
|
|
||||||
* [Remote Peer Configuration](remote-peer-configuration.html) - Managing remote peers and tracking lifecycle
|
|
||||||
* [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html) - Adoption and rollout notes
|
|
||||||
|
|
||||||
### Deployment & Operations
|
# CBDDC Documentation
|
||||||
* [Deployment (LAN)](deployment-lan.html) - Platform-specific deployment instructions
|
|
||||||
* [Production Hardening](production-hardening.html) - Configuration, monitoring, and best practices
|
Welcome to the CBDDC documentation site.
|
||||||
* [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html) - Operational deprecation/removal workflow
|
|
||||||
|
## Canonical Operational Docs
|
||||||
## Previous Versions
|
|
||||||
|
- [Architecture](architecture.md)
|
||||||
- [v0.8.x Documentation](v0.8/getting-started.html)
|
- [Deployment](deployment.md)
|
||||||
- [v0.7.x Documentation](v0.7/getting-started.html)
|
- [Runbook](runbook.md)
|
||||||
- [v0.6.x Documentation](v0.6/getting-started.html)
|
- [Security](security.md)
|
||||||
|
- [Access and Permissions](access.md)
|
||||||
## Links
|
- [Troubleshooting](troubleshooting.md)
|
||||||
|
|
||||||
* [GitHub Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
|
## Feature Docs
|
||||||
* [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
|
|
||||||
* [Changelog](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md)
|
- [Feature Inventory](features/README.md)
|
||||||
|
- [Selective Collection Sync](features/selective-collection-sync.md)
|
||||||
|
- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
|
||||||
|
- [Secure Peer Transport](features/secure-peer-transport.md)
|
||||||
|
- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
|
||||||
|
|
||||||
|
## Additional References
|
||||||
|
|
||||||
|
- [Getting Started](getting-started.md)
|
||||||
|
- [API Reference](api-reference.md)
|
||||||
|
- [Persistence Providers](persistence-providers.md)
|
||||||
|
- [Deployment Modes](deployment-modes.md)
|
||||||
|
- [Deployment (LAN)](deployment-lan.md)
|
||||||
|
- [Production Hardening](production-hardening.md)
|
||||||
|
- [Conflict Resolution](conflict-resolution.md)
|
||||||
|
- [Network Telemetry](network-telemetry.md)
|
||||||
|
- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
|
||||||
|
- [Remote Peer Configuration](remote-peer-configuration.md)
|
||||||
|
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
|
||||||
|
- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
|
||||||
|
- [Database Sync Manager Design](database-sync-manager-design.md)
|
||||||
|
|
||||||
|
## Release History
|
||||||
|
|
||||||
|
See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for version-level changes.
|
||||||
|
|||||||
@@ -104,5 +104,5 @@ Fields stored per peer:
|
|||||||
which removes the peer from active prune-gating.
|
which removes the peer from active prune-gating.
|
||||||
- Re-observed peers can be re-registered and become active for tracking again.
|
- Re-observed peers can be re-registered and become active for tracking again.
|
||||||
- For rollout steps and production operations, see:
|
- For rollout steps and production operations, see:
|
||||||
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html)
|
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
|
||||||
- [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html)
|
- [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)
|
||||||
|
|||||||
58
docs/runbook.md
Normal file
58
docs/runbook.md
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
# Operations Runbook
|
||||||
|
|
||||||
|
This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
|
||||||
|
|
||||||
|
## Ownership and Escalation
|
||||||
|
|
||||||
|
- Service owner: CBDDC Core Maintainers.
|
||||||
|
- First response: local platform/application on-call team for the affected deployment.
|
||||||
|
- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
|
||||||
|
|
||||||
|
## Alert Triage
|
||||||
|
|
||||||
|
1. Identify severity based on impact:
|
||||||
|
- Sev 1: Data integrity risk, sustained outage, or broad replication failure.
|
||||||
|
- Sev 2: Partial sync degradation or prolonged peer lag.
|
||||||
|
- Sev 3: Isolated node issue with workaround.
|
||||||
|
2. Confirm current `cbddc` health check status and payload.
|
||||||
|
3. Identify affected peers, collections, and first observed time.
|
||||||
|
4. Apply the relevant recovery play below.
|
||||||
|
|
||||||
|
## Core Diagnostics
|
||||||
|
|
||||||
|
Capture these artifacts before remediation:
|
||||||
|
|
||||||
|
- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`).
|
||||||
|
- Application logs for sync, persistence, and network components.
|
||||||
|
- Current runtime configuration (excluding secrets).
|
||||||
|
- Most recent deployment identifier and change window.
|
||||||
|
|
||||||
|
## Recovery Plays
|
||||||
|
|
||||||
|
### Peer unreachable or lagging
|
||||||
|
|
||||||
|
1. Verify network path and auth token consistency.
|
||||||
|
2. Validate peer is still expected in topology.
|
||||||
|
3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md).
|
||||||
|
4. Recheck health status after remediation.
|
||||||
|
|
||||||
|
### Persistence failure
|
||||||
|
|
||||||
|
1. Verify storage path and permissions.
|
||||||
|
2. Run integrity checks.
|
||||||
|
3. Restore from latest valid backup if corruption is confirmed.
|
||||||
|
4. Validate replication behavior after restore.
|
||||||
|
|
||||||
|
### Configuration drift
|
||||||
|
|
||||||
|
1. Compare deployed config to approved baseline.
|
||||||
|
2. Reapply canonical settings.
|
||||||
|
3. Restart affected service safely.
|
||||||
|
4. Verify recovery with health and smoke checks.
|
||||||
|
|
||||||
|
## Post-Incident Actions
|
||||||
|
|
||||||
|
1. Record root cause and timeline.
|
||||||
|
2. Add follow-up work items (tests, alerts, docs updates).
|
||||||
|
3. Update affected feature docs and troubleshooting guidance.
|
||||||
|
4. Confirm rollback and recovery instructions remain accurate.
|
||||||
@@ -140,6 +140,6 @@ A: Security works on all target frameworks (netstandard2.0, net6.0, net8.0) with
|
|||||||
---
|
---
|
||||||
|
|
||||||
**See Also:**
|
**See Also:**
|
||||||
- [Getting Started](getting-started.html)
|
- [Getting Started](getting-started.md)
|
||||||
- [Architecture](architecture.html)
|
- [Architecture](architecture.md)
|
||||||
- [Production Hardening](production-hardening.html)
|
- [Production Hardening](production-hardening.md)
|
||||||
|
|||||||
86
docs/troubleshooting.md
Normal file
86
docs/troubleshooting.md
Normal file
@@ -0,0 +1,86 @@
|
|||||||
|
# Troubleshooting
|
||||||
|
|
||||||
|
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
|
||||||
|
|
||||||
|
## Peer Cannot Connect
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Node remains disconnected from expected peers.
|
||||||
|
- Health check reports lagging or unconfirmed peers.
|
||||||
|
|
||||||
|
Likely causes:
|
||||||
|
|
||||||
|
- Network path blocked (port/firewall mismatch).
|
||||||
|
- `AuthToken` mismatch.
|
||||||
|
- Peer configuration drift.
|
||||||
|
|
||||||
|
Resolution:
|
||||||
|
|
||||||
|
1. Verify TCP/UDP port configuration on both peers.
|
||||||
|
2. Confirm shared token and node identity settings.
|
||||||
|
3. Restart peer service and monitor logs.
|
||||||
|
4. Recheck `cbddc` health payload.
|
||||||
|
|
||||||
|
## Replication Delay or Missing Updates
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Writes are visible locally but not on remote peers.
|
||||||
|
- `maxLagMs` grows continuously.
|
||||||
|
|
||||||
|
Likely causes:
|
||||||
|
|
||||||
|
- Retired peer still tracked and gating pruning.
|
||||||
|
- High load or transient network instability.
|
||||||
|
- Invalid collection watch configuration.
|
||||||
|
|
||||||
|
Resolution:
|
||||||
|
|
||||||
|
1. Confirm affected collections are registered with `WatchCollection()`.
|
||||||
|
2. Inspect peer confirmation metrics.
|
||||||
|
3. If needed, de-track retired peers using the runbook.
|
||||||
|
4. Re-run smoke sync validation after changes.
|
||||||
|
|
||||||
|
## Persistence Errors
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Startup or write failures from persistence layer.
|
||||||
|
- Unhealthy health check due to storage exceptions.
|
||||||
|
|
||||||
|
Likely causes:
|
||||||
|
|
||||||
|
- File/path permission errors.
|
||||||
|
- Storage corruption.
|
||||||
|
- Misconfigured provider settings.
|
||||||
|
|
||||||
|
Resolution:
|
||||||
|
|
||||||
|
1. Validate storage path and runtime permissions.
|
||||||
|
2. Run integrity checks.
|
||||||
|
3. Restore from latest good backup if corruption is detected.
|
||||||
|
4. Validate read/write and replication after restore.
|
||||||
|
|
||||||
|
## Configuration Regressions After Release
|
||||||
|
|
||||||
|
Symptoms:
|
||||||
|
|
||||||
|
- Behavior changed immediately after deployment.
|
||||||
|
- Multiple nodes fail with same error pattern.
|
||||||
|
|
||||||
|
Likely causes:
|
||||||
|
|
||||||
|
- Incorrect environment variables or appsettings values.
|
||||||
|
- Partial rollout with incompatible settings.
|
||||||
|
|
||||||
|
Resolution:
|
||||||
|
|
||||||
|
1. Compare deployed configuration to approved baseline.
|
||||||
|
2. Roll back to last known-good release if production impact is high.
|
||||||
|
3. Redeploy with corrected configuration.
|
||||||
|
4. Document root cause and preventive controls.
|
||||||
|
|
||||||
|
## Escalation
|
||||||
|
|
||||||
|
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.
|
||||||
@@ -66,4 +66,4 @@ If rollout exposes unexpected persistent degradation:
|
|||||||
3. Keep full peer removal for nodes that are confirmed decommissioned.
|
3. Keep full peer removal for nodes that are confirmed decommissioned.
|
||||||
|
|
||||||
For a detailed operator procedure, see
|
For a detailed operator procedure, see
|
||||||
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html).
|
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md).
|
||||||
|
|||||||
Reference in New Issue
Block a user