docs: align internal docs to enterprise standards
All checks were successful
CI / verify (push) Successful in 2m33s

Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
This commit is contained in:
Joseph Doherty
2026-02-20 13:23:55 -05:00
parent e6d81f6350
commit ce727eb30d
18 changed files with 783 additions and 186 deletions

164
README.md
View File

@@ -5,7 +5,6 @@
**Peer-to-Peer Data Synchronization Middleware for .NET**
[![.NET Version](https://img.shields.io/badge/.NET-8.0%20%7C%2010.0-purple)](https://dotnet.microsoft.com/)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
## Status
![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)
@@ -22,9 +21,18 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
## Table of Contents
- [Overview](#overview)
- [Ownership and Support](#ownership-and-support)
- [Architecture](#architecture)
- [Key Features](#key-features)
- [Installation](#installation)
- [Prerequisites](#prerequisites)
- [Configuration and Secrets](#configuration-and-secrets)
- [Build, Test, and Quality Gates](#build-test-and-quality-gates)
- [Deployment and Rollback](#deployment-and-rollback)
- [Operations and Incident Response](#operations-and-incident-response)
- [Security and Compliance Posture](#security-and-compliance-posture)
- [Troubleshooting](#troubleshooting)
- [Change Governance](#change-governance)
- [Quick Start](#quick-start)
- [Integrating with Your Database](#integrating-with-your-database)
- [Cloud Deployment](#cloud-deployment)
@@ -32,7 +40,6 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
- [Use Cases](#use-cases)
- [Documentation](#documentation)
- [Contributing](#contributing)
- [License](#license)
---
@@ -50,6 +57,15 @@ Your application continues to read and write to its database as usual. CBDDC wor
---
## Ownership and Support
- **Owning team**: CBDDC Core Maintainers
- **Issue reporting**: [GitHub Issues](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues)
- **Usage/support questions**: [GitHub Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
- **Incident escalation for active deployments**: Follow your local on-call process first, then attach CBDDC diagnostics from `docs/runbook.md`
---
## Architecture
```
@@ -163,6 +179,107 @@ dotnet add package CBDDC.Network
---
## Prerequisites
Before onboarding a new node, verify:
- .NET SDK/runtime required by selected packages (recommended: .NET 8+)
- Network access for configured TCP/UDP sync ports
- Persistent storage path with write permissions for database and backups
- Access to deployment secrets (cluster `AuthToken`, connection strings, environment-specific config)
- Ability to run health checks (`/health` in hosted mode)
---
## Configuration and Secrets
CBDDC runtime behavior should be configured from environment-specific settings, not hardcoded values.
- Required runtime settings:
- `NodeId`
- `TcpPort` (and `UdpPort` when discovery is enabled)
- `AuthToken` (secret; do not commit)
- Persistence connection/database path
- Secret handling guidance:
- Keep secrets in a secret manager or protected environment variables
- Rotate tokens during maintenance windows and validate mesh convergence after rotation
- Avoid exposing tokens in logs, screenshots, or troubleshooting output
See `docs/security.md` and `docs/production-hardening.md` for detailed control guidance.
---
## Build, Test, and Quality Gates
Run these checks before merge or release:
```bash
dotnet restore
dotnet build
dotnet test
```
Recommended release gates:
- Unit/integration tests pass
- Health endpoint reports healthy in staging after deployment
- No unresolved high-severity issues in deployment or runbook checklists
---
## Deployment and Rollback
Canonical deployment procedures are documented in:
- `docs/deployment.md` (promotion flow, validation gates, rollback path)
- `docs/deployment-lan.md` (LAN deployment details)
- `docs/deployment-modes.md` (hosted mode behavior)
- `docs/production-hardening.md` (operational hardening controls)
---
## Operations and Incident Response
Use `docs/runbook.md` as the primary operational runbook. It includes:
- Alert triage sequence
- Escalation path and severity mapping
- Recovery and verification steps
- Post-incident follow-up expectations
For peer lifecycle incidents, use `docs/peer-deprecation-removal-runbook.md`.
---
## Security and Compliance Posture
- Primary protections: authenticated peer communication and encrypted transport options in CBDDC networking components
- Sensitive data handling: classify data by environment and apply least privilege on runtime access
- Operational controls: hardening, key rotation, and monitoring requirements are documented in `docs/security.md` and `docs/access.md`
---
## Troubleshooting
Common issue patterns and step-by-step remediation are documented in `docs/troubleshooting.md`.
High-priority troubleshooting topics:
- Peer unreachable or lagging confirmation state
- Persistence faults and backup restore flow
- Authentication mismatch or configuration drift
---
## Change Governance
- Use short-lived branches and pull requests for every change
- Require review before merge to protected branches
- Run build/test/health validation before release
- Record release-impacting changes in `CHANGELOG.md` and link related deployment notes
---
## Quick Start
### 1. Define Your Database Context
@@ -608,13 +725,14 @@ Console.WriteLine($"Peers: {status.ConnectedPeers}");
- **[Architecture & Concepts](docs/architecture.md)** - HLC, Gossip, Vector Clocks, Hash Chains
- **[Conflict Resolution](docs/conflict-resolution.md)** - LWW vs Recursive Merge
- **[Oplog & CDC](docs/oplog-cdc.md)** - How change tracking works
- **[Oplog & CDC Design](docs/database-sync-manager-design.md)** - How change tracking works
### Deployment
- **[Production Guide](docs/production-hardening.md)** - Configuration, monitoring, best practices
- **[Cloud Deployment](docs/cloud-deployment.md)** - ASP.NET Core hosting
- **[Deployment Runbook](docs/deployment.md)** - Promotion flow, validation, rollback
- **[Deployment Modes](docs/deployment-modes.md)** - Single-cluster deployment strategy
- **[Operations Runbook](docs/runbook.md)** - Incident triage and recovery
### API
@@ -711,41 +829,3 @@ dotnet run
### Code of Conduct
Be respectful, inclusive, and constructive. We're all here to learn and build great software together.
---
## License
CBDDC is licensed under the **MIT License**.
```
MIT License
Copyright (c) 2026 MrDevRobot
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software...
```
See [LICENSE](LICENSE) file for full details.
---
## Give it a Star!
If you find CBDDC useful, please **give it a star** on GitHub! It helps others discover the project and motivates us to keep improving it.
<div align="center">
### [Star on GitHub](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
**Thank you for your support!**
**Built with care for the .NET community**
[Report Bug](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Request Feature](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
</div>

View File

@@ -1,73 +1,45 @@
# CBDDC Documentation
This folder contains the official documentation for CBDDC, published as GitHub Pages.
## Documentation Structure (v0.9.0)
### Getting Started
- **[Getting Started](getting-started.md)** - Installation, setup, and first steps with CBDDC
### Core Documentation
- **[Architecture](architecture.md)** - Hybrid Logical Clocks, Gossip Protocol, mesh networking
- **[API Reference](api-reference.md)** - Complete API documentation with examples
- **[Querying](querying.md)** - Data querying patterns and LINQ support
### Persistence & Storage
- **[Persistence Providers](persistence-providers.md)** - SQLite, EF Core, PostgreSQL comparison
- **[Deployment Modes](deployment-modes.md)** - Single-cluster deployment strategy
### Networking & Security
- **[Security](security.md)** - Encryption, authentication, secure networking
- **[Conflict Resolution](conflict-resolution.md)** - LWW and Recursive Merge strategies
- **[Network Telemetry](network-telemetry.md)** - Monitoring and diagnostics
- **[Dynamic Reconfiguration](dynamic-reconfiguration.md)** - Runtime configuration changes
- **[Remote Peer Configuration](remote-peer-configuration.md)** - Managing remote peers and tracking lifecycle
- **[Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)** - Rollout notes and adoption checklist
# CBDDC Documentation Index
### Deployment & Operations
- **[Deployment (LAN)](deployment-lan.md)** - Platform-specific deployment guide
- **[Production Hardening](production-hardening.md)** - Configuration, monitoring, best practices
- **[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)** - Operational workflow for de-tracking and removal
## Building the Documentation
This documentation uses Jekyll with the Cayman theme and is automatically published via GitHub Pages.
### Local Development
```bash
# Install Jekyll (requires Ruby)
gem install bundler jekyll
# Serve documentation locally
cd docs
jekyll serve
# Open http://localhost:4000
```
### Site Configuration
- **_config.yml** - Jekyll configuration and site metadata
- **_layouts/default.html** - Main page layout with navigation and styling
- **_includes/nav.html** - Top navigation bar
- **_data/navigation.yml** - Sidebar navigation structure per version
## Version History
The documentation supports multiple versions:
- **v0.9** (current) - Latest stable release
- **v0.8** - ASP.NET Core hosting, EF Core, PostgreSQL support
- **v0.7** - Brotli compression, Protocol v4
- **v0.6** - Secure networking, conflict resolution strategies
## Contributing to Documentation
1. **Fix typos or improve clarity** - Submit PRs directly
2. **Add new guides** - Create a new .md file and update navigation.yml
3. **Test locally** - Run Jekyll locally to preview changes before submitting
## Links
- [Main Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
- [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
This folder contains operational and engineering documentation for CBDDC.
## Core Documentation
- [Architecture](architecture.md)
- [Deployment](deployment.md)
- [Runbook](runbook.md)
- [Security](security.md)
- [Access and Permissions](access.md)
- [Troubleshooting](troubleshooting.md)
## Feature Documentation
- [Feature Inventory](features/README.md)
- [Selective Collection Sync](features/selective-collection-sync.md)
- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
- [Secure Peer Transport](features/secure-peer-transport.md)
- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
## Additional Technical References
- [Getting Started](getting-started.md)
- [API Reference](api-reference.md)
- [Conflict Resolution](conflict-resolution.md)
- [Persistence Providers](persistence-providers.md)
- [Network Telemetry](network-telemetry.md)
- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
- [Remote Peer Configuration](remote-peer-configuration.md)
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
- [Database Sync Manager Design](database-sync-manager-design.md)
## Deployment-Specific Guides
- [Deployment Modes](deployment-modes.md)
- [Deployment (LAN)](deployment-lan.md)
- [Production Hardening](production-hardening.md)
## Documentation Conventions
- Use lowercase kebab-case for markdown files.
- Keep canonical operational docs at the top level under `docs/`.
- Store major feature documentation in `docs/features/` and keep `docs/features/README.md` updated.

44
docs/access.md Normal file
View File

@@ -0,0 +1,44 @@
# Access and Permissions
This document defines the least-privilege access model for CBDDC environments.
## Roles
| Role | Typical Permissions | Approval Required |
|------|---------------------|-------------------|
| Runtime Operator | Read health/logs, restart service, run incident checks | Team lead or on-call manager |
| Deployment Engineer | Deploy approved releases, update runtime configuration | Change approval for production |
| Security Administrator | Manage secrets, rotate tokens, review access | Security approval |
| Maintainer | Modify CBDDC source/docs, merge reviewed changes | Pull request review |
## Least-Privilege Rules
- Grant access by role, not by individual preference.
- Use environment-specific credentials and scoped service accounts.
- Do not share production credentials across environments.
- Remove elevated access promptly after incident or change window.
## Approval Flow
1. Request access with role, environment, and business reason.
2. Approver validates least-privilege scope.
3. Access is granted with expiration date when applicable.
4. Grant/revoke events are logged for auditability.
## Periodic Access Review
- Review active privileged access at least quarterly.
- Remove dormant or unowned accounts immediately.
- Validate that emergency access accounts are controlled and monitored.
## Secret Handling
- Store `AuthToken`, connection strings, and credentials in approved secret stores.
- Never commit secrets to source control.
- Rotate secrets after incidents and on scheduled cadence.
## Related Documents
- [Security](security.md)
- [Runbook](runbook.md)
- [Production Hardening](production-hardening.md)

View File

@@ -0,0 +1,19 @@
# ADR 0001: Peer Mesh Topology
## Status
Accepted
## Context
CBDDC requires local-first synchronization without a central coordinator in LAN-focused deployments.
## Decision
Adopt a peer mesh topology with gossip-based synchronization and per-node oplog tracking.
## Consequences
- Improves resilience when individual nodes disconnect.
- Requires strong operational tooling for lag, peer lifecycle, and pruning controls.
- Increases importance of clear runbook and troubleshooting documentation.

View File

@@ -246,6 +246,6 @@ A: No. Conflict resolution assumes both documents have compatible schemas.
---
**See Also:**
- [Getting Started](getting-started.html)
- [Architecture](architecture.html)
- [Security](security.html)
- [Getting Started](getting-started.md)
- [Architecture](architecture.md)
- [Security](security.md)

61
docs/deployment.md Normal file
View File

@@ -0,0 +1,61 @@
# Deployment
This document defines the canonical CBDDC deployment workflow, validation gates, and rollback procedures.
## Environments
- Local development: single node for functional verification.
- Staging: multi-node mesh or hosted mode with production-like configuration.
- Production: controlled rollout with health checks, monitoring, and rollback readiness.
## Promotion Workflow
1. Build and test in CI (`dotnet build`, `dotnet test`).
2. Validate configuration and secrets for the target environment.
3. Deploy to staging.
4. Run smoke checks:
- Node starts and joins expected peers.
- `/health` reports healthy.
- Sync operations replicate across target collections.
5. Promote to production using approved change window.
## Validation Gates
- No failed test suites.
- No unresolved critical incidents.
- Health check status is healthy or approved degraded state with mitigation.
- Backup/restore path confirmed for the active persistence provider.
## Rollback Triggers
Rollback immediately when any of the following occurs:
- Sync correctness regression (missed or duplicate replication events).
- Persistent unhealthy status after remediation attempt.
- Authentication or connectivity failure across required peers.
- Data integrity concern reported by validation checks.
## Rollback Procedure
1. Stop traffic or disable rollout to newly deployed nodes.
2. Revert to the last known-good build.
3. Restore previous configuration and secret set.
4. If needed, restore persistence snapshot/backup.
5. Re-run health and replication smoke checks.
6. Record incident details in the runbook timeline.
## Emergency Changes
Use emergency deployment only for incident containment:
1. Capture incident reference and approver.
2. Apply minimal scoped change.
3. Run abbreviated smoke checks (startup, health, critical replication path).
4. Follow up with standard post-incident review.
## Related Documents
- [Deployment Modes](deployment-modes.md)
- [Deployment (LAN)](deployment-lan.md)
- [Production Hardening](production-hardening.md)
- [Runbook](runbook.md)

16
docs/features/README.md Normal file
View File

@@ -0,0 +1,16 @@
# Major Feature Inventory
This index tracks CBDDC major functionality. Each feature has one canonical document.
## Features
- [Selective Collection Sync](selective-collection-sync.md)
- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
- [Secure Peer Transport](secure-peer-transport.md)
- [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
## Maintenance Rules
- Keep one file per major feature.
- Update feature docs when APIs, operations, or security controls change.
- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md).

View File

@@ -0,0 +1,68 @@
# Feature: Peer-Confirmed Pruning
## Purpose and Business Outcome
Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams.
## Scope and Non-Goals
Scope:
- Retention cutoff plus peer confirmation cutoff logic.
- Operational controls for peer tracking and de-tracking.
Non-goals:
- Automatic removal of retired peers without operator action.
- Replacement for backup/restore strategy.
## User and System Workflows
1. Maintenance job calculates retention and confirmation cutoffs.
2. System blocks pruning when required confirmations are missing.
3. Operator de-tracks retired peers when appropriate.
4. Pruning resumes once constraints are satisfied.
## Interfaces, APIs, and Events Involved
- Maintenance pruning scheduler
- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`)
- Health metrics for lag and missing confirmations
## Permissions and Data Handling
- Only approved operators should modify peer tracking state.
- Pruning decisions must be auditable through logs and incident notes.
## Dependencies and Failure Modes
Dependencies:
- Accurate peer tracking metadata
- Timely confirmation updates per source stream
Failure modes:
- Pruning blocked indefinitely by stale peer tracking
- Unsafe pruning if controls are bypassed
## Monitoring, Alerts, and Troubleshooting Pointers
- Monitor peers with missing confirmations and sustained lag.
- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions.
## Rollout and Change Considerations
- Introduce pruning policy changes with explicit maintenance windows.
- Validate expected cutoff behavior in staging before production rollout.
## Validation and Testability Guidance
- Add tests for blocked prune when confirmations are missing.
- Add tests for resumed prune after de-tracking retired peers.
- Smoke test health status transitions around peer lifecycle changes.
## Related Security Controls
- [Security](../security.md)
- [Runbook](../runbook.md)

View File

@@ -0,0 +1,70 @@
# Feature: Peer-to-Peer Gossip Sync
## Purpose and Business Outcome
Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity.
## Scope and Non-Goals
Scope:
- Peer discovery and sync orchestration.
- Push/pull propagation of oplog changes.
Non-goals:
- Strong global consistency guarantees.
- Public internet exposure without additional network controls.
## User and System Workflows
1. Node starts and discovers peers.
2. Node exchanges sync metadata with connected peers.
3. Missing operations are requested and applied.
4. Mesh converges over repeated gossip rounds.
## Interfaces, APIs, and Events Involved
- Sync orchestrator scheduling
- TCP peer sync channels
- Vector clock exchange and reconciliation
## Permissions and Data Handling
- Peers with valid authentication token can exchange replicated collection data.
- Cluster membership should be restricted to trusted nodes.
## Dependencies and Failure Modes
Dependencies:
- Network reachability
- Shared authentication material
- Healthy persistence layer
Failure modes:
- Peer isolation due to network outage
- Token mismatch blocking synchronization
- Sustained lag under high write pressure
## Monitoring, Alerts, and Troubleshooting Pointers
- Monitor `laggingPeers`, `maxLagMs`, and active peer counts.
- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers.
## Rollout and Change Considerations
- Roll out protocol-affecting changes in a controlled window.
- Confirm backward/forward compatibility in staging mesh before production rollout.
## Validation and Testability Guidance
- Run multi-node integration tests with controlled partitions.
- Validate eventual convergence after reconnect.
- Verify no data-loss under repeated reconnect scenarios.
## Related Security Controls
- [Security](../security.md)
- [Access and Permissions](../access.md)

View File

@@ -0,0 +1,67 @@
# Feature: Secure Peer Transport
## Purpose and Business Outcome
Protect replicated data in transit with authenticated and encrypted peer communication.
## Scope and Non-Goals
Scope:
- Secure handshake and key establishment.
- Message confidentiality and integrity controls.
Non-goals:
- Data-at-rest encryption.
- Full identity and certificate lifecycle management.
## User and System Workflows
1. Operator enables secure transport components.
2. Peers perform handshake and establish session keys.
3. Replication traffic is encrypted/authenticated.
4. Health and logs expose secure mode status.
## Interfaces, APIs, and Events Involved
- `IPeerHandshakeService` / secure handshake implementation
- Network pipeline message encryption and HMAC validation
- Startup configuration for secure mode
## Permissions and Data Handling
- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators.
- Logs must avoid plaintext secret disclosure.
## Dependencies and Failure Modes
Dependencies:
- Consistent security mode across peers
- Valid runtime cryptographic dependencies
Failure modes:
- Secure/plaintext mode mismatch
- Handshake failure due to key/token mismatch
## Monitoring, Alerts, and Troubleshooting Pointers
- Alert on repeated handshake failures.
- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation.
## Rollout and Change Considerations
- Enable secure mode in staging first.
- Roll production nodes in controlled order to avoid mixed-mode partitions.
## Validation and Testability Guidance
- Add tests for secure-to-secure success and mixed-mode rejection.
- Validate encrypted cluster startup and sync with production-like load.
## Related Security Controls
- [Security](../security.md)
- [Access and Permissions](../access.md)

View File

@@ -0,0 +1,69 @@
# Feature: Selective Collection Sync
## Purpose and Business Outcome
Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data.
## Scope and Non-Goals
Scope:
- Register collections for replication using `WatchCollection()`.
- Replicate changes for registered collections across peers.
Non-goals:
- Automatic replication of all database collections.
- Schema migration management.
## User and System Workflows
1. Developer registers target collections in the document store.
2. Local writes trigger CDC events.
3. Oplog entries propagate through peer sync.
4. Remote peers apply updates for matching collections.
## Interfaces, APIs, and Events Involved
- `WatchCollection(collectionName, collection, keySelector)`
- CDC trigger pipeline
- Oplog append and apply operations
## Permissions and Data Handling
- Access to source collections is controlled by host application permissions.
- Only approved collections should be registered for sync in sensitive environments.
## Dependencies and Failure Modes
Dependencies:
- Correct collection registration
- Stable peer connectivity
- Persistence availability
Failure modes:
- Missed replication due to unregistered collection
- Delayed propagation during network partition
## Monitoring, Alerts, and Troubleshooting Pointers
- Monitor replication lag and peer confirmation metrics.
- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response.
## Rollout and Change Considerations
- Introduce new synced collections behind staged rollout.
- Validate downstream consumer compatibility before production enablement.
## Validation and Testability Guidance
- Add integration tests verifying only registered collections replicate.
- Smoke test by writing to registered and non-registered collections and confirming expected behavior.
- Validate no unexpected collection appears in remote peers after deployment.
## Related Security Controls
- [Security](../security.md)
- [Access and Permissions](../access.md)

View File

@@ -179,10 +179,10 @@ Choose your strategy:
## Next Steps
- [Architecture Overview](architecture.html) - Understand HLC, Gossip Protocol, and mesh networking
- [Persistence Providers](persistence-providers.html) - Choose the right database for your deployment
- [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
- [Security Configuration](security.html) - Encryption and authentication
- [Conflict Resolution Strategies](conflict-resolution.html) - LWW vs Recursive Merge
- [Production Hardening](production-hardening.html) - Best practices and monitoring
- [API Reference](api-reference.html) - Complete API documentation
- [Architecture Overview](architecture.md) - Understand HLC, Gossip Protocol, and mesh networking
- [Persistence Providers](persistence-providers.md) - Choose the right database for your deployment
- [Deployment Modes](deployment-modes.md) - Single-cluster deployment strategy
- [Security Configuration](security.md) - Encryption and authentication
- [Conflict Resolution Strategies](conflict-resolution.md) - LWW vs Recursive Merge
- [Production Hardening](production-hardening.md) - Best practices and monitoring
- [API Reference](api-reference.md) - Complete API documentation

View File

@@ -1,57 +1,44 @@
---
layout: default
---
# CBDDC Documentation
Welcome to the CBDDC documentation for **version 0.9.0**.
## What's New in v0.9.0
Version 0.9.0 brings enhanced ASP.NET Core support, improved EF Core runtime stability, and refined synchronization and persistence layers.
- **ASP.NET Core Enhancements**: Improved sample application with better error handling
- **EF Core Stability**: Fixed runtime issues and improved persistence layer reliability
- **Sync & Persistence**: Enhanced stability across all persistence providers
- **Production Ready**: All features tested and stable for production use
See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for complete details.
## Documentation
### Getting Started
* [**Getting Started**](getting-started.html) - Installation, basic setup, and your first CBDDC application
### Core Concepts
* [Architecture](architecture.html) - Understanding HLC, Gossip Protocol, and P2P mesh networking
* [API Reference](api-reference.html) - Complete API documentation and examples
* [Querying](querying.html) - Data querying patterns and LINQ support
### Persistence & Storage
* [Persistence Providers](persistence-providers.html) - SQLite, EF Core, PostgreSQL comparison and configuration
* [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
### Networking & Security
* [Security](security.html) - Encryption, authentication, and secure networking
* [Conflict Resolution](conflict-resolution.html) - LWW and Recursive Merge strategies
* [Network Telemetry](network-telemetry.html) - Monitoring and diagnostics
* [Dynamic Reconfiguration](dynamic-reconfiguration.html) - Runtime configuration and leader election
* [Remote Peer Configuration](remote-peer-configuration.html) - Managing remote peers and tracking lifecycle
* [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html) - Adoption and rollout notes
---
layout: default
---
### Deployment & Operations
* [Deployment (LAN)](deployment-lan.html) - Platform-specific deployment instructions
* [Production Hardening](production-hardening.html) - Configuration, monitoring, and best practices
* [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html) - Operational deprecation/removal workflow
## Previous Versions
- [v0.8.x Documentation](v0.8/getting-started.html)
- [v0.7.x Documentation](v0.7/getting-started.html)
- [v0.6.x Documentation](v0.6/getting-started.html)
## Links
* [GitHub Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
* [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
* [Changelog](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md)
# CBDDC Documentation
Welcome to the CBDDC documentation site.
## Canonical Operational Docs
- [Architecture](architecture.md)
- [Deployment](deployment.md)
- [Runbook](runbook.md)
- [Security](security.md)
- [Access and Permissions](access.md)
- [Troubleshooting](troubleshooting.md)
## Feature Docs
- [Feature Inventory](features/README.md)
- [Selective Collection Sync](features/selective-collection-sync.md)
- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
- [Secure Peer Transport](features/secure-peer-transport.md)
- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
## Additional References
- [Getting Started](getting-started.md)
- [API Reference](api-reference.md)
- [Persistence Providers](persistence-providers.md)
- [Deployment Modes](deployment-modes.md)
- [Deployment (LAN)](deployment-lan.md)
- [Production Hardening](production-hardening.md)
- [Conflict Resolution](conflict-resolution.md)
- [Network Telemetry](network-telemetry.md)
- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
- [Remote Peer Configuration](remote-peer-configuration.md)
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
- [Database Sync Manager Design](database-sync-manager-design.md)
## Release History
See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for version-level changes.

View File

@@ -104,5 +104,5 @@ Fields stored per peer:
which removes the peer from active prune-gating.
- Re-observed peers can be re-registered and become active for tracking again.
- For rollout steps and production operations, see:
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html)
- [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html)
- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
- [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)

58
docs/runbook.md Normal file
View File

@@ -0,0 +1,58 @@
# Operations Runbook
This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
## Ownership and Escalation
- Service owner: CBDDC Core Maintainers.
- First response: local platform/application on-call team for the affected deployment.
- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
## Alert Triage
1. Identify severity based on impact:
- Sev 1: Data integrity risk, sustained outage, or broad replication failure.
- Sev 2: Partial sync degradation or prolonged peer lag.
- Sev 3: Isolated node issue with workaround.
2. Confirm current `cbddc` health check status and payload.
3. Identify affected peers, collections, and first observed time.
4. Apply the relevant recovery play below.
## Core Diagnostics
Capture these artifacts before remediation:
- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`).
- Application logs for sync, persistence, and network components.
- Current runtime configuration (excluding secrets).
- Most recent deployment identifier and change window.
## Recovery Plays
### Peer unreachable or lagging
1. Verify network path and auth token consistency.
2. Validate peer is still expected in topology.
3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md).
4. Recheck health status after remediation.
### Persistence failure
1. Verify storage path and permissions.
2. Run integrity checks.
3. Restore from latest valid backup if corruption is confirmed.
4. Validate replication behavior after restore.
### Configuration drift
1. Compare deployed config to approved baseline.
2. Reapply canonical settings.
3. Restart affected service safely.
4. Verify recovery with health and smoke checks.
## Post-Incident Actions
1. Record root cause and timeline.
2. Add follow-up work items (tests, alerts, docs updates).
3. Update affected feature docs and troubleshooting guidance.
4. Confirm rollback and recovery instructions remain accurate.

View File

@@ -140,6 +140,6 @@ A: Security works on all target frameworks (netstandard2.0, net6.0, net8.0) with
---
**See Also:**
- [Getting Started](getting-started.html)
- [Architecture](architecture.html)
- [Production Hardening](production-hardening.html)
- [Getting Started](getting-started.md)
- [Architecture](architecture.md)
- [Production Hardening](production-hardening.md)

86
docs/troubleshooting.md Normal file
View File

@@ -0,0 +1,86 @@
# Troubleshooting
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
## Peer Cannot Connect
Symptoms:
- Node remains disconnected from expected peers.
- Health check reports lagging or unconfirmed peers.
Likely causes:
- Network path blocked (port/firewall mismatch).
- `AuthToken` mismatch.
- Peer configuration drift.
Resolution:
1. Verify TCP/UDP port configuration on both peers.
2. Confirm shared token and node identity settings.
3. Restart peer service and monitor logs.
4. Recheck `cbddc` health payload.
## Replication Delay or Missing Updates
Symptoms:
- Writes are visible locally but not on remote peers.
- `maxLagMs` grows continuously.
Likely causes:
- Retired peer still tracked and gating pruning.
- High load or transient network instability.
- Invalid collection watch configuration.
Resolution:
1. Confirm affected collections are registered with `WatchCollection()`.
2. Inspect peer confirmation metrics.
3. If needed, de-track retired peers using the runbook.
4. Re-run smoke sync validation after changes.
## Persistence Errors
Symptoms:
- Startup or write failures from persistence layer.
- Unhealthy health check due to storage exceptions.
Likely causes:
- File/path permission errors.
- Storage corruption.
- Misconfigured provider settings.
Resolution:
1. Validate storage path and runtime permissions.
2. Run integrity checks.
3. Restore from latest good backup if corruption is detected.
4. Validate read/write and replication after restore.
## Configuration Regressions After Release
Symptoms:
- Behavior changed immediately after deployment.
- Multiple nodes fail with same error pattern.
Likely causes:
- Incorrect environment variables or appsettings values.
- Partial rollout with incompatible settings.
Resolution:
1. Compare deployed configuration to approved baseline.
2. Roll back to last known-good release if production impact is high.
3. Redeploy with corrected configuration.
4. Document root cause and preventive controls.
## Escalation
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.

View File

@@ -66,4 +66,4 @@ If rollout exposes unexpected persistent degradation:
3. Keep full peer removal for nodes that are confirmed decommissioned.
For a detailed operator procedure, see
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html).
[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md).