diff --git a/README.md b/README.md index 34a2601..0c246f1 100755 --- a/README.md +++ b/README.md @@ -5,7 +5,6 @@ **Peer-to-Peer Data Synchronization Middleware for .NET** [![.NET Version](https://img.shields.io/badge/.NET-8.0%20%7C%2010.0-purple)](https://dotnet.microsoft.com/) -[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE) ## Status ![Version](https://img.shields.io/badge/version-1.0.0-blue.svg) @@ -22,9 +21,18 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da ## Table of Contents - [Overview](#overview) +- [Ownership and Support](#ownership-and-support) - [Architecture](#architecture) - [Key Features](#key-features) - [Installation](#installation) +- [Prerequisites](#prerequisites) +- [Configuration and Secrets](#configuration-and-secrets) +- [Build, Test, and Quality Gates](#build-test-and-quality-gates) +- [Deployment and Rollback](#deployment-and-rollback) +- [Operations and Incident Response](#operations-and-incident-response) +- [Security and Compliance Posture](#security-and-compliance-posture) +- [Troubleshooting](#troubleshooting) +- [Change Governance](#change-governance) - [Quick Start](#quick-start) - [Integrating with Your Database](#integrating-with-your-database) - [Cloud Deployment](#cloud-deployment) @@ -32,7 +40,6 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da - [Use Cases](#use-cases) - [Documentation](#documentation) - [Contributing](#contributing) -- [License](#license) --- @@ -50,6 +57,15 @@ Your application continues to read and write to its database as usual. CBDDC wor --- +## Ownership and Support + +- **Owning team**: CBDDC Core Maintainers +- **Issue reporting**: [GitHub Issues](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) +- **Usage/support questions**: [GitHub Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions) +- **Incident escalation for active deployments**: Follow your local on-call process first, then attach CBDDC diagnostics from `docs/runbook.md` + +--- + ## Architecture ``` @@ -163,6 +179,107 @@ dotnet add package CBDDC.Network --- +## Prerequisites + +Before onboarding a new node, verify: + +- .NET SDK/runtime required by selected packages (recommended: .NET 8+) +- Network access for configured TCP/UDP sync ports +- Persistent storage path with write permissions for database and backups +- Access to deployment secrets (cluster `AuthToken`, connection strings, environment-specific config) +- Ability to run health checks (`/health` in hosted mode) + +--- + +## Configuration and Secrets + +CBDDC runtime behavior should be configured from environment-specific settings, not hardcoded values. + +- Required runtime settings: + - `NodeId` + - `TcpPort` (and `UdpPort` when discovery is enabled) + - `AuthToken` (secret; do not commit) + - Persistence connection/database path +- Secret handling guidance: + - Keep secrets in a secret manager or protected environment variables + - Rotate tokens during maintenance windows and validate mesh convergence after rotation + - Avoid exposing tokens in logs, screenshots, or troubleshooting output + +See `docs/security.md` and `docs/production-hardening.md` for detailed control guidance. + +--- + +## Build, Test, and Quality Gates + +Run these checks before merge or release: + +```bash +dotnet restore +dotnet build +dotnet test +``` + +Recommended release gates: + +- Unit/integration tests pass +- Health endpoint reports healthy in staging after deployment +- No unresolved high-severity issues in deployment or runbook checklists + +--- + +## Deployment and Rollback + +Canonical deployment procedures are documented in: + +- `docs/deployment.md` (promotion flow, validation gates, rollback path) +- `docs/deployment-lan.md` (LAN deployment details) +- `docs/deployment-modes.md` (hosted mode behavior) +- `docs/production-hardening.md` (operational hardening controls) + +--- + +## Operations and Incident Response + +Use `docs/runbook.md` as the primary operational runbook. It includes: + +- Alert triage sequence +- Escalation path and severity mapping +- Recovery and verification steps +- Post-incident follow-up expectations + +For peer lifecycle incidents, use `docs/peer-deprecation-removal-runbook.md`. + +--- + +## Security and Compliance Posture + +- Primary protections: authenticated peer communication and encrypted transport options in CBDDC networking components +- Sensitive data handling: classify data by environment and apply least privilege on runtime access +- Operational controls: hardening, key rotation, and monitoring requirements are documented in `docs/security.md` and `docs/access.md` + +--- + +## Troubleshooting + +Common issue patterns and step-by-step remediation are documented in `docs/troubleshooting.md`. + +High-priority troubleshooting topics: + +- Peer unreachable or lagging confirmation state +- Persistence faults and backup restore flow +- Authentication mismatch or configuration drift + +--- + +## Change Governance + +- Use short-lived branches and pull requests for every change +- Require review before merge to protected branches +- Run build/test/health validation before release +- Record release-impacting changes in `CHANGELOG.md` and link related deployment notes + +--- + ## Quick Start ### 1. Define Your Database Context @@ -608,13 +725,14 @@ Console.WriteLine($"Peers: {status.ConnectedPeers}"); - **[Architecture & Concepts](docs/architecture.md)** - HLC, Gossip, Vector Clocks, Hash Chains - **[Conflict Resolution](docs/conflict-resolution.md)** - LWW vs Recursive Merge -- **[Oplog & CDC](docs/oplog-cdc.md)** - How change tracking works +- **[Oplog & CDC Design](docs/database-sync-manager-design.md)** - How change tracking works ### Deployment - **[Production Guide](docs/production-hardening.md)** - Configuration, monitoring, best practices -- **[Cloud Deployment](docs/cloud-deployment.md)** - ASP.NET Core hosting +- **[Deployment Runbook](docs/deployment.md)** - Promotion flow, validation, rollback - **[Deployment Modes](docs/deployment-modes.md)** - Single-cluster deployment strategy +- **[Operations Runbook](docs/runbook.md)** - Incident triage and recovery ### API @@ -711,41 +829,3 @@ dotnet run ### Code of Conduct Be respectful, inclusive, and constructive. We're all here to learn and build great software together. - ---- - -## License - -CBDDC is licensed under the **MIT License**. - -``` -MIT License - -Copyright (c) 2026 MrDevRobot - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software... -``` - -See [LICENSE](LICENSE) file for full details. - ---- - -## Give it a Star! - -If you find CBDDC useful, please **give it a star** on GitHub! It helps others discover the project and motivates us to keep improving it. - -
- -### [Star on GitHub](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net) - -**Thank you for your support!** - -**Built with care for the .NET community** - -[Report Bug](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Request Feature](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions) - -
diff --git a/docs/README.md b/docs/README.md index 7d2c457..9fa2cf3 100755 --- a/docs/README.md +++ b/docs/README.md @@ -1,73 +1,45 @@ -# CBDDC Documentation - -This folder contains the official documentation for CBDDC, published as GitHub Pages. - -## Documentation Structure (v0.9.0) - -### Getting Started -- **[Getting Started](getting-started.md)** - Installation, setup, and first steps with CBDDC - -### Core Documentation -- **[Architecture](architecture.md)** - Hybrid Logical Clocks, Gossip Protocol, mesh networking -- **[API Reference](api-reference.md)** - Complete API documentation with examples -- **[Querying](querying.md)** - Data querying patterns and LINQ support - -### Persistence & Storage -- **[Persistence Providers](persistence-providers.md)** - SQLite, EF Core, PostgreSQL comparison -- **[Deployment Modes](deployment-modes.md)** - Single-cluster deployment strategy - -### Networking & Security -- **[Security](security.md)** - Encryption, authentication, secure networking -- **[Conflict Resolution](conflict-resolution.md)** - LWW and Recursive Merge strategies -- **[Network Telemetry](network-telemetry.md)** - Monitoring and diagnostics -- **[Dynamic Reconfiguration](dynamic-reconfiguration.md)** - Runtime configuration changes -- **[Remote Peer Configuration](remote-peer-configuration.md)** - Managing remote peers and tracking lifecycle -- **[Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)** - Rollout notes and adoption checklist +# CBDDC Documentation Index -### Deployment & Operations -- **[Deployment (LAN)](deployment-lan.md)** - Platform-specific deployment guide -- **[Production Hardening](production-hardening.md)** - Configuration, monitoring, best practices -- **[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)** - Operational workflow for de-tracking and removal - -## Building the Documentation - -This documentation uses Jekyll with the Cayman theme and is automatically published via GitHub Pages. - -### Local Development - -```bash -# Install Jekyll (requires Ruby) -gem install bundler jekyll - -# Serve documentation locally -cd docs -jekyll serve - -# Open http://localhost:4000 -``` - -### Site Configuration - -- **_config.yml** - Jekyll configuration and site metadata -- **_layouts/default.html** - Main page layout with navigation and styling -- **_includes/nav.html** - Top navigation bar -- **_data/navigation.yml** - Sidebar navigation structure per version - -## Version History - -The documentation supports multiple versions: -- **v0.9** (current) - Latest stable release -- **v0.8** - ASP.NET Core hosting, EF Core, PostgreSQL support -- **v0.7** - Brotli compression, Protocol v4 -- **v0.6** - Secure networking, conflict resolution strategies - -## Contributing to Documentation - -1. **Fix typos or improve clarity** - Submit PRs directly -2. **Add new guides** - Create a new .md file and update navigation.yml -3. **Test locally** - Run Jekyll locally to preview changes before submitting - -## Links - -- [Main Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net) -- [NuGet Packages](https://www.nuget.org/packages?q=CBDDC) +This folder contains operational and engineering documentation for CBDDC. + +## Core Documentation + +- [Architecture](architecture.md) +- [Deployment](deployment.md) +- [Runbook](runbook.md) +- [Security](security.md) +- [Access and Permissions](access.md) +- [Troubleshooting](troubleshooting.md) + +## Feature Documentation + +- [Feature Inventory](features/README.md) +- [Selective Collection Sync](features/selective-collection-sync.md) +- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md) +- [Secure Peer Transport](features/secure-peer-transport.md) +- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md) + +## Additional Technical References + +- [Getting Started](getting-started.md) +- [API Reference](api-reference.md) +- [Conflict Resolution](conflict-resolution.md) +- [Persistence Providers](persistence-providers.md) +- [Network Telemetry](network-telemetry.md) +- [Dynamic Reconfiguration](dynamic-reconfiguration.md) +- [Remote Peer Configuration](remote-peer-configuration.md) +- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md) +- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md) +- [Database Sync Manager Design](database-sync-manager-design.md) + +## Deployment-Specific Guides + +- [Deployment Modes](deployment-modes.md) +- [Deployment (LAN)](deployment-lan.md) +- [Production Hardening](production-hardening.md) + +## Documentation Conventions + +- Use lowercase kebab-case for markdown files. +- Keep canonical operational docs at the top level under `docs/`. +- Store major feature documentation in `docs/features/` and keep `docs/features/README.md` updated. diff --git a/docs/access.md b/docs/access.md new file mode 100644 index 0000000..1a2d257 --- /dev/null +++ b/docs/access.md @@ -0,0 +1,44 @@ +# Access and Permissions + +This document defines the least-privilege access model for CBDDC environments. + +## Roles + +| Role | Typical Permissions | Approval Required | +|------|---------------------|-------------------| +| Runtime Operator | Read health/logs, restart service, run incident checks | Team lead or on-call manager | +| Deployment Engineer | Deploy approved releases, update runtime configuration | Change approval for production | +| Security Administrator | Manage secrets, rotate tokens, review access | Security approval | +| Maintainer | Modify CBDDC source/docs, merge reviewed changes | Pull request review | + +## Least-Privilege Rules + +- Grant access by role, not by individual preference. +- Use environment-specific credentials and scoped service accounts. +- Do not share production credentials across environments. +- Remove elevated access promptly after incident or change window. + +## Approval Flow + +1. Request access with role, environment, and business reason. +2. Approver validates least-privilege scope. +3. Access is granted with expiration date when applicable. +4. Grant/revoke events are logged for auditability. + +## Periodic Access Review + +- Review active privileged access at least quarterly. +- Remove dormant or unowned accounts immediately. +- Validate that emergency access accounts are controlled and monitored. + +## Secret Handling + +- Store `AuthToken`, connection strings, and credentials in approved secret stores. +- Never commit secrets to source control. +- Rotate secrets after incidents and on scheduled cadence. + +## Related Documents + +- [Security](security.md) +- [Runbook](runbook.md) +- [Production Hardening](production-hardening.md) diff --git a/docs/adr/0001-peer-mesh-topology.md b/docs/adr/0001-peer-mesh-topology.md new file mode 100644 index 0000000..2388131 --- /dev/null +++ b/docs/adr/0001-peer-mesh-topology.md @@ -0,0 +1,19 @@ +# ADR 0001: Peer Mesh Topology + +## Status + +Accepted + +## Context + +CBDDC requires local-first synchronization without a central coordinator in LAN-focused deployments. + +## Decision + +Adopt a peer mesh topology with gossip-based synchronization and per-node oplog tracking. + +## Consequences + +- Improves resilience when individual nodes disconnect. +- Requires strong operational tooling for lag, peer lifecycle, and pruning controls. +- Increases importance of clear runbook and troubleshooting documentation. diff --git a/docs/conflict-resolution.md b/docs/conflict-resolution.md index 485b785..df10ac8 100755 --- a/docs/conflict-resolution.md +++ b/docs/conflict-resolution.md @@ -246,6 +246,6 @@ A: No. Conflict resolution assumes both documents have compatible schemas. --- **See Also:** -- [Getting Started](getting-started.html) -- [Architecture](architecture.html) -- [Security](security.html) +- [Getting Started](getting-started.md) +- [Architecture](architecture.md) +- [Security](security.md) diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 0000000..8d0946e --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,61 @@ +# Deployment + +This document defines the canonical CBDDC deployment workflow, validation gates, and rollback procedures. + +## Environments + +- Local development: single node for functional verification. +- Staging: multi-node mesh or hosted mode with production-like configuration. +- Production: controlled rollout with health checks, monitoring, and rollback readiness. + +## Promotion Workflow + +1. Build and test in CI (`dotnet build`, `dotnet test`). +2. Validate configuration and secrets for the target environment. +3. Deploy to staging. +4. Run smoke checks: + - Node starts and joins expected peers. + - `/health` reports healthy. + - Sync operations replicate across target collections. +5. Promote to production using approved change window. + +## Validation Gates + +- No failed test suites. +- No unresolved critical incidents. +- Health check status is healthy or approved degraded state with mitigation. +- Backup/restore path confirmed for the active persistence provider. + +## Rollback Triggers + +Rollback immediately when any of the following occurs: + +- Sync correctness regression (missed or duplicate replication events). +- Persistent unhealthy status after remediation attempt. +- Authentication or connectivity failure across required peers. +- Data integrity concern reported by validation checks. + +## Rollback Procedure + +1. Stop traffic or disable rollout to newly deployed nodes. +2. Revert to the last known-good build. +3. Restore previous configuration and secret set. +4. If needed, restore persistence snapshot/backup. +5. Re-run health and replication smoke checks. +6. Record incident details in the runbook timeline. + +## Emergency Changes + +Use emergency deployment only for incident containment: + +1. Capture incident reference and approver. +2. Apply minimal scoped change. +3. Run abbreviated smoke checks (startup, health, critical replication path). +4. Follow up with standard post-incident review. + +## Related Documents + +- [Deployment Modes](deployment-modes.md) +- [Deployment (LAN)](deployment-lan.md) +- [Production Hardening](production-hardening.md) +- [Runbook](runbook.md) diff --git a/docs/features/README.md b/docs/features/README.md new file mode 100644 index 0000000..3aa9a29 --- /dev/null +++ b/docs/features/README.md @@ -0,0 +1,16 @@ +# Major Feature Inventory + +This index tracks CBDDC major functionality. Each feature has one canonical document. + +## Features + +- [Selective Collection Sync](selective-collection-sync.md) +- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md) +- [Secure Peer Transport](secure-peer-transport.md) +- [Peer-Confirmed Pruning](peer-confirmed-pruning.md) + +## Maintenance Rules + +- Keep one file per major feature. +- Update feature docs when APIs, operations, or security controls change. +- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md). diff --git a/docs/features/peer-confirmed-pruning.md b/docs/features/peer-confirmed-pruning.md new file mode 100644 index 0000000..68b8202 --- /dev/null +++ b/docs/features/peer-confirmed-pruning.md @@ -0,0 +1,68 @@ +# Feature: Peer-Confirmed Pruning + +## Purpose and Business Outcome + +Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams. + +## Scope and Non-Goals + +Scope: + +- Retention cutoff plus peer confirmation cutoff logic. +- Operational controls for peer tracking and de-tracking. + +Non-goals: + +- Automatic removal of retired peers without operator action. +- Replacement for backup/restore strategy. + +## User and System Workflows + +1. Maintenance job calculates retention and confirmation cutoffs. +2. System blocks pruning when required confirmations are missing. +3. Operator de-tracks retired peers when appropriate. +4. Pruning resumes once constraints are satisfied. + +## Interfaces, APIs, and Events Involved + +- Maintenance pruning scheduler +- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`) +- Health metrics for lag and missing confirmations + +## Permissions and Data Handling + +- Only approved operators should modify peer tracking state. +- Pruning decisions must be auditable through logs and incident notes. + +## Dependencies and Failure Modes + +Dependencies: + +- Accurate peer tracking metadata +- Timely confirmation updates per source stream + +Failure modes: + +- Pruning blocked indefinitely by stale peer tracking +- Unsafe pruning if controls are bypassed + +## Monitoring, Alerts, and Troubleshooting Pointers + +- Monitor peers with missing confirmations and sustained lag. +- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions. + +## Rollout and Change Considerations + +- Introduce pruning policy changes with explicit maintenance windows. +- Validate expected cutoff behavior in staging before production rollout. + +## Validation and Testability Guidance + +- Add tests for blocked prune when confirmations are missing. +- Add tests for resumed prune after de-tracking retired peers. +- Smoke test health status transitions around peer lifecycle changes. + +## Related Security Controls + +- [Security](../security.md) +- [Runbook](../runbook.md) diff --git a/docs/features/peer-to-peer-gossip-sync.md b/docs/features/peer-to-peer-gossip-sync.md new file mode 100644 index 0000000..9536af3 --- /dev/null +++ b/docs/features/peer-to-peer-gossip-sync.md @@ -0,0 +1,70 @@ +# Feature: Peer-to-Peer Gossip Sync + +## Purpose and Business Outcome + +Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity. + +## Scope and Non-Goals + +Scope: + +- Peer discovery and sync orchestration. +- Push/pull propagation of oplog changes. + +Non-goals: + +- Strong global consistency guarantees. +- Public internet exposure without additional network controls. + +## User and System Workflows + +1. Node starts and discovers peers. +2. Node exchanges sync metadata with connected peers. +3. Missing operations are requested and applied. +4. Mesh converges over repeated gossip rounds. + +## Interfaces, APIs, and Events Involved + +- Sync orchestrator scheduling +- TCP peer sync channels +- Vector clock exchange and reconciliation + +## Permissions and Data Handling + +- Peers with valid authentication token can exchange replicated collection data. +- Cluster membership should be restricted to trusted nodes. + +## Dependencies and Failure Modes + +Dependencies: + +- Network reachability +- Shared authentication material +- Healthy persistence layer + +Failure modes: + +- Peer isolation due to network outage +- Token mismatch blocking synchronization +- Sustained lag under high write pressure + +## Monitoring, Alerts, and Troubleshooting Pointers + +- Monitor `laggingPeers`, `maxLagMs`, and active peer counts. +- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers. + +## Rollout and Change Considerations + +- Roll out protocol-affecting changes in a controlled window. +- Confirm backward/forward compatibility in staging mesh before production rollout. + +## Validation and Testability Guidance + +- Run multi-node integration tests with controlled partitions. +- Validate eventual convergence after reconnect. +- Verify no data-loss under repeated reconnect scenarios. + +## Related Security Controls + +- [Security](../security.md) +- [Access and Permissions](../access.md) diff --git a/docs/features/secure-peer-transport.md b/docs/features/secure-peer-transport.md new file mode 100644 index 0000000..6cfc532 --- /dev/null +++ b/docs/features/secure-peer-transport.md @@ -0,0 +1,67 @@ +# Feature: Secure Peer Transport + +## Purpose and Business Outcome + +Protect replicated data in transit with authenticated and encrypted peer communication. + +## Scope and Non-Goals + +Scope: + +- Secure handshake and key establishment. +- Message confidentiality and integrity controls. + +Non-goals: + +- Data-at-rest encryption. +- Full identity and certificate lifecycle management. + +## User and System Workflows + +1. Operator enables secure transport components. +2. Peers perform handshake and establish session keys. +3. Replication traffic is encrypted/authenticated. +4. Health and logs expose secure mode status. + +## Interfaces, APIs, and Events Involved + +- `IPeerHandshakeService` / secure handshake implementation +- Network pipeline message encryption and HMAC validation +- Startup configuration for secure mode + +## Permissions and Data Handling + +- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators. +- Logs must avoid plaintext secret disclosure. + +## Dependencies and Failure Modes + +Dependencies: + +- Consistent security mode across peers +- Valid runtime cryptographic dependencies + +Failure modes: + +- Secure/plaintext mode mismatch +- Handshake failure due to key/token mismatch + +## Monitoring, Alerts, and Troubleshooting Pointers + +- Alert on repeated handshake failures. +- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation. + +## Rollout and Change Considerations + +- Enable secure mode in staging first. +- Roll production nodes in controlled order to avoid mixed-mode partitions. + +## Validation and Testability Guidance + +- Add tests for secure-to-secure success and mixed-mode rejection. +- Validate encrypted cluster startup and sync with production-like load. + +## Related Security Controls + +- [Security](../security.md) +- [Access and Permissions](../access.md) diff --git a/docs/features/selective-collection-sync.md b/docs/features/selective-collection-sync.md new file mode 100644 index 0000000..b3afffe --- /dev/null +++ b/docs/features/selective-collection-sync.md @@ -0,0 +1,69 @@ +# Feature: Selective Collection Sync + +## Purpose and Business Outcome + +Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data. + +## Scope and Non-Goals + +Scope: + +- Register collections for replication using `WatchCollection()`. +- Replicate changes for registered collections across peers. + +Non-goals: + +- Automatic replication of all database collections. +- Schema migration management. + +## User and System Workflows + +1. Developer registers target collections in the document store. +2. Local writes trigger CDC events. +3. Oplog entries propagate through peer sync. +4. Remote peers apply updates for matching collections. + +## Interfaces, APIs, and Events Involved + +- `WatchCollection(collectionName, collection, keySelector)` +- CDC trigger pipeline +- Oplog append and apply operations + +## Permissions and Data Handling + +- Access to source collections is controlled by host application permissions. +- Only approved collections should be registered for sync in sensitive environments. + +## Dependencies and Failure Modes + +Dependencies: + +- Correct collection registration +- Stable peer connectivity +- Persistence availability + +Failure modes: + +- Missed replication due to unregistered collection +- Delayed propagation during network partition + +## Monitoring, Alerts, and Troubleshooting Pointers + +- Monitor replication lag and peer confirmation metrics. +- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response. + +## Rollout and Change Considerations + +- Introduce new synced collections behind staged rollout. +- Validate downstream consumer compatibility before production enablement. + +## Validation and Testability Guidance + +- Add integration tests verifying only registered collections replicate. +- Smoke test by writing to registered and non-registered collections and confirming expected behavior. +- Validate no unexpected collection appears in remote peers after deployment. + +## Related Security Controls + +- [Security](../security.md) +- [Access and Permissions](../access.md) diff --git a/docs/getting-started.md b/docs/getting-started.md index 832a1e3..da2a82d 100755 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -179,10 +179,10 @@ Choose your strategy: ## Next Steps -- [Architecture Overview](architecture.html) - Understand HLC, Gossip Protocol, and mesh networking -- [Persistence Providers](persistence-providers.html) - Choose the right database for your deployment -- [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy -- [Security Configuration](security.html) - Encryption and authentication -- [Conflict Resolution Strategies](conflict-resolution.html) - LWW vs Recursive Merge -- [Production Hardening](production-hardening.html) - Best practices and monitoring -- [API Reference](api-reference.html) - Complete API documentation +- [Architecture Overview](architecture.md) - Understand HLC, Gossip Protocol, and mesh networking +- [Persistence Providers](persistence-providers.md) - Choose the right database for your deployment +- [Deployment Modes](deployment-modes.md) - Single-cluster deployment strategy +- [Security Configuration](security.md) - Encryption and authentication +- [Conflict Resolution Strategies](conflict-resolution.md) - LWW vs Recursive Merge +- [Production Hardening](production-hardening.md) - Best practices and monitoring +- [API Reference](api-reference.md) - Complete API documentation diff --git a/docs/index.md b/docs/index.md index 89f0ee1..6501465 100755 --- a/docs/index.md +++ b/docs/index.md @@ -1,57 +1,44 @@ ---- -layout: default ---- - -# CBDDC Documentation - -Welcome to the CBDDC documentation for **version 0.9.0**. - -## What's New in v0.9.0 - -Version 0.9.0 brings enhanced ASP.NET Core support, improved EF Core runtime stability, and refined synchronization and persistence layers. - -- **ASP.NET Core Enhancements**: Improved sample application with better error handling -- **EF Core Stability**: Fixed runtime issues and improved persistence layer reliability -- **Sync & Persistence**: Enhanced stability across all persistence providers -- **Production Ready**: All features tested and stable for production use - -See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for complete details. - -## Documentation - -### Getting Started -* [**Getting Started**](getting-started.html) - Installation, basic setup, and your first CBDDC application - -### Core Concepts -* [Architecture](architecture.html) - Understanding HLC, Gossip Protocol, and P2P mesh networking -* [API Reference](api-reference.html) - Complete API documentation and examples -* [Querying](querying.html) - Data querying patterns and LINQ support - -### Persistence & Storage -* [Persistence Providers](persistence-providers.html) - SQLite, EF Core, PostgreSQL comparison and configuration -* [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy - -### Networking & Security -* [Security](security.html) - Encryption, authentication, and secure networking -* [Conflict Resolution](conflict-resolution.html) - LWW and Recursive Merge strategies -* [Network Telemetry](network-telemetry.html) - Monitoring and diagnostics -* [Dynamic Reconfiguration](dynamic-reconfiguration.html) - Runtime configuration and leader election -* [Remote Peer Configuration](remote-peer-configuration.html) - Managing remote peers and tracking lifecycle -* [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html) - Adoption and rollout notes +--- +layout: default +--- -### Deployment & Operations -* [Deployment (LAN)](deployment-lan.html) - Platform-specific deployment instructions -* [Production Hardening](production-hardening.html) - Configuration, monitoring, and best practices -* [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html) - Operational deprecation/removal workflow - -## Previous Versions - -- [v0.8.x Documentation](v0.8/getting-started.html) -- [v0.7.x Documentation](v0.7/getting-started.html) -- [v0.6.x Documentation](v0.6/getting-started.html) - -## Links - -* [GitHub Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net) -* [NuGet Packages](https://www.nuget.org/packages?q=CBDDC) -* [Changelog](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) +# CBDDC Documentation + +Welcome to the CBDDC documentation site. + +## Canonical Operational Docs + +- [Architecture](architecture.md) +- [Deployment](deployment.md) +- [Runbook](runbook.md) +- [Security](security.md) +- [Access and Permissions](access.md) +- [Troubleshooting](troubleshooting.md) + +## Feature Docs + +- [Feature Inventory](features/README.md) +- [Selective Collection Sync](features/selective-collection-sync.md) +- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md) +- [Secure Peer Transport](features/secure-peer-transport.md) +- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md) + +## Additional References + +- [Getting Started](getting-started.md) +- [API Reference](api-reference.md) +- [Persistence Providers](persistence-providers.md) +- [Deployment Modes](deployment-modes.md) +- [Deployment (LAN)](deployment-lan.md) +- [Production Hardening](production-hardening.md) +- [Conflict Resolution](conflict-resolution.md) +- [Network Telemetry](network-telemetry.md) +- [Dynamic Reconfiguration](dynamic-reconfiguration.md) +- [Remote Peer Configuration](remote-peer-configuration.md) +- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md) +- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md) +- [Database Sync Manager Design](database-sync-manager-design.md) + +## Release History + +See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for version-level changes. diff --git a/docs/remote-peer-configuration.md b/docs/remote-peer-configuration.md index a8f0728..5621830 100755 --- a/docs/remote-peer-configuration.md +++ b/docs/remote-peer-configuration.md @@ -104,5 +104,5 @@ Fields stored per peer: which removes the peer from active prune-gating. - Re-observed peers can be re-registered and become active for tracking again. - For rollout steps and production operations, see: - - [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html) - - [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html) + - [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md) + - [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md) diff --git a/docs/runbook.md b/docs/runbook.md new file mode 100644 index 0000000..d53e16f --- /dev/null +++ b/docs/runbook.md @@ -0,0 +1,58 @@ +# Operations Runbook + +This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery. + +## Ownership and Escalation + +- Service owner: CBDDC Core Maintainers. +- First response: local platform/application on-call team for the affected deployment. +- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload. + +## Alert Triage + +1. Identify severity based on impact: + - Sev 1: Data integrity risk, sustained outage, or broad replication failure. + - Sev 2: Partial sync degradation or prolonged peer lag. + - Sev 3: Isolated node issue with workaround. +2. Confirm current `cbddc` health check status and payload. +3. Identify affected peers, collections, and first observed time. +4. Apply the relevant recovery play below. + +## Core Diagnostics + +Capture these artifacts before remediation: + +- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`). +- Application logs for sync, persistence, and network components. +- Current runtime configuration (excluding secrets). +- Most recent deployment identifier and change window. + +## Recovery Plays + +### Peer unreachable or lagging + +1. Verify network path and auth token consistency. +2. Validate peer is still expected in topology. +3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md). +4. Recheck health status after remediation. + +### Persistence failure + +1. Verify storage path and permissions. +2. Run integrity checks. +3. Restore from latest valid backup if corruption is confirmed. +4. Validate replication behavior after restore. + +### Configuration drift + +1. Compare deployed config to approved baseline. +2. Reapply canonical settings. +3. Restart affected service safely. +4. Verify recovery with health and smoke checks. + +## Post-Incident Actions + +1. Record root cause and timeline. +2. Add follow-up work items (tests, alerts, docs updates). +3. Update affected feature docs and troubleshooting guidance. +4. Confirm rollback and recovery instructions remain accurate. diff --git a/docs/security.md b/docs/security.md index 6028bcb..f44fbb1 100755 --- a/docs/security.md +++ b/docs/security.md @@ -140,6 +140,6 @@ A: Security works on all target frameworks (netstandard2.0, net6.0, net8.0) with --- **See Also:** -- [Getting Started](getting-started.html) -- [Architecture](architecture.html) -- [Production Hardening](production-hardening.html) +- [Getting Started](getting-started.md) +- [Architecture](architecture.md) +- [Production Hardening](production-hardening.md) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..0877f47 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,86 @@ +# Troubleshooting + +This guide lists recurring CBDDC failure modes, likely causes, and remediation steps. + +## Peer Cannot Connect + +Symptoms: + +- Node remains disconnected from expected peers. +- Health check reports lagging or unconfirmed peers. + +Likely causes: + +- Network path blocked (port/firewall mismatch). +- `AuthToken` mismatch. +- Peer configuration drift. + +Resolution: + +1. Verify TCP/UDP port configuration on both peers. +2. Confirm shared token and node identity settings. +3. Restart peer service and monitor logs. +4. Recheck `cbddc` health payload. + +## Replication Delay or Missing Updates + +Symptoms: + +- Writes are visible locally but not on remote peers. +- `maxLagMs` grows continuously. + +Likely causes: + +- Retired peer still tracked and gating pruning. +- High load or transient network instability. +- Invalid collection watch configuration. + +Resolution: + +1. Confirm affected collections are registered with `WatchCollection()`. +2. Inspect peer confirmation metrics. +3. If needed, de-track retired peers using the runbook. +4. Re-run smoke sync validation after changes. + +## Persistence Errors + +Symptoms: + +- Startup or write failures from persistence layer. +- Unhealthy health check due to storage exceptions. + +Likely causes: + +- File/path permission errors. +- Storage corruption. +- Misconfigured provider settings. + +Resolution: + +1. Validate storage path and runtime permissions. +2. Run integrity checks. +3. Restore from latest good backup if corruption is detected. +4. Validate read/write and replication after restore. + +## Configuration Regressions After Release + +Symptoms: + +- Behavior changed immediately after deployment. +- Multiple nodes fail with same error pattern. + +Likely causes: + +- Incorrect environment variables or appsettings values. +- Partial rollout with incompatible settings. + +Resolution: + +1. Compare deployed configuration to approved baseline. +2. Roll back to last known-good release if production impact is high. +3. Redeploy with corrected configuration. +4. Document root cause and preventive controls. + +## Escalation + +If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures. diff --git a/docs/upgrade-peer-confirmed-pruning.md b/docs/upgrade-peer-confirmed-pruning.md index af60963..bb4e0ee 100644 --- a/docs/upgrade-peer-confirmed-pruning.md +++ b/docs/upgrade-peer-confirmed-pruning.md @@ -66,4 +66,4 @@ If rollout exposes unexpected persistent degradation: 3. Keep full peer removal for nodes that are confirmed decommissioned. For a detailed operator procedure, see -[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html). +[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md).