docs: align internal docs to enterprise standards

Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00
parent e6d81f6350
commit ce727eb30d
18 changed files with 783 additions and 186 deletions
--- a/README.md
+++ b/README.md
@@ -5,7 +5,6 @@
 **Peer-to-Peer Data Synchronization Middleware for .NET**

 [![.NET Version](https://img.shields.io/badge/.NET-8.0%20%7C%2010.0-purple)](https://dotnet.microsoft.com/)
-[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

 ## Status
 ![Version](https://img.shields.io/badge/version-1.0.0-blue.svg)
@@ -22,9 +21,18 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
 ## Table of Contents

 - [Overview](#overview)
+- [Ownership and Support](#ownership-and-support)
 - [Architecture](#architecture)
 - [Key Features](#key-features)
 - [Installation](#installation)
+- [Prerequisites](#prerequisites)
+- [Configuration and Secrets](#configuration-and-secrets)
+- [Build, Test, and Quality Gates](#build-test-and-quality-gates)
+- [Deployment and Rollback](#deployment-and-rollback)
+- [Operations and Incident Response](#operations-and-incident-response)
+- [Security and Compliance Posture](#security-and-compliance-posture)
+- [Troubleshooting](#troubleshooting)
+- [Change Governance](#change-governance)
 - [Quick Start](#quick-start)
 - [Integrating with Your Database](#integrating-with-your-database)
 - [Cloud Deployment](#cloud-deployment)
@@ -32,7 +40,6 @@ CBDDC is not a database - it's a **sync layer** that plugs into your existing da
 - [Use Cases](#use-cases)
 - [Documentation](#documentation)
 - [Contributing](#contributing)
- [License](#license)

 ---

@@ -50,6 +57,15 @@ Your application continues to read and write to its database as usual. CBDDC wor

 ---

+## Ownership and Support
+
+- **Owning team**: CBDDC Core Maintainers
+- **Issue reporting**: [GitHub Issues](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues)
+- **Usage/support questions**: [GitHub Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
+- **Incident escalation for active deployments**: Follow your local on-call process first, then attach CBDDC diagnostics from `docs/runbook.md`
+
+---
+
 ## Architecture

 ```
@@ -163,6 +179,107 @@ dotnet add package CBDDC.Network

 ---

+## Prerequisites
+
+Before onboarding a new node, verify:
+
+- .NET SDK/runtime required by selected packages (recommended: .NET 8+)
+- Network access for configured TCP/UDP sync ports
+- Persistent storage path with write permissions for database and backups
+- Access to deployment secrets (cluster `AuthToken`, connection strings, environment-specific config)
+- Ability to run health checks (`/health` in hosted mode)
+
+---
+
+## Configuration and Secrets
+
+CBDDC runtime behavior should be configured from environment-specific settings, not hardcoded values.
+
+- Required runtime settings:
+  - `NodeId`
+  - `TcpPort` (and `UdpPort` when discovery is enabled)
+  - `AuthToken` (secret; do not commit)
+  - Persistence connection/database path
+- Secret handling guidance:
+  - Keep secrets in a secret manager or protected environment variables
+  - Rotate tokens during maintenance windows and validate mesh convergence after rotation
+  - Avoid exposing tokens in logs, screenshots, or troubleshooting output
+
+See `docs/security.md` and `docs/production-hardening.md` for detailed control guidance.
+
+---
+
+## Build, Test, and Quality Gates
+
+Run these checks before merge or release:
+
+```bash
+dotnet restore
+dotnet build
+dotnet test
+```
+
+Recommended release gates:
+
+- Unit/integration tests pass
+- Health endpoint reports healthy in staging after deployment
+- No unresolved high-severity issues in deployment or runbook checklists
+
+---
+
+## Deployment and Rollback
+
+Canonical deployment procedures are documented in:
+
+- `docs/deployment.md` (promotion flow, validation gates, rollback path)
+- `docs/deployment-lan.md` (LAN deployment details)
+- `docs/deployment-modes.md` (hosted mode behavior)
+- `docs/production-hardening.md` (operational hardening controls)
+
+---
+
+## Operations and Incident Response
+
+Use `docs/runbook.md` as the primary operational runbook. It includes:
+
+- Alert triage sequence
+- Escalation path and severity mapping
+- Recovery and verification steps
+- Post-incident follow-up expectations
+
+For peer lifecycle incidents, use `docs/peer-deprecation-removal-runbook.md`.
+
+---
+
+## Security and Compliance Posture
+
+- Primary protections: authenticated peer communication and encrypted transport options in CBDDC networking components
+- Sensitive data handling: classify data by environment and apply least privilege on runtime access
+- Operational controls: hardening, key rotation, and monitoring requirements are documented in `docs/security.md` and `docs/access.md`
+
+---
+
+## Troubleshooting
+
+Common issue patterns and step-by-step remediation are documented in `docs/troubleshooting.md`.
+
+High-priority troubleshooting topics:
+
+- Peer unreachable or lagging confirmation state
+- Persistence faults and backup restore flow
+- Authentication mismatch or configuration drift
+
+---
+
+## Change Governance
+
+- Use short-lived branches and pull requests for every change
+- Require review before merge to protected branches
+- Run build/test/health validation before release
+- Record release-impacting changes in `CHANGELOG.md` and link related deployment notes
+
+---
+
 ## Quick Start

 ### 1. Define Your Database Context
@@ -608,13 +725,14 @@ Console.WriteLine($"Peers: {status.ConnectedPeers}");

 - **[Architecture & Concepts](docs/architecture.md)** - HLC, Gossip, Vector Clocks, Hash Chains
 - **[Conflict Resolution](docs/conflict-resolution.md)** - LWW vs Recursive Merge
- **[Oplog & CDC](docs/oplog-cdc.md)** - How change tracking works
+- **[Oplog & CDC Design](docs/database-sync-manager-design.md)** - How change tracking works

 ### Deployment

 - **[Production Guide](docs/production-hardening.md)** - Configuration, monitoring, best practices
- **[Cloud Deployment](docs/cloud-deployment.md)** - ASP.NET Core hosting
+- **[Deployment Runbook](docs/deployment.md)** - Promotion flow, validation, rollback
 - **[Deployment Modes](docs/deployment-modes.md)** - Single-cluster deployment strategy
+- **[Operations Runbook](docs/runbook.md)** - Incident triage and recovery

 ### API

@@ -711,41 +829,3 @@ dotnet run
 ### Code of Conduct

 Be respectful, inclusive, and constructive. We're all here to learn and build great software together.
-
---
-
-## License
-
-CBDDC is licensed under the **MIT License**.
-
-```
-MIT License
-
-Copyright (c) 2026 MrDevRobot
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software...
-```
-
-See [LICENSE](LICENSE) file for full details.
-
---
-
-## Give it a Star!
-
-If you find CBDDC useful, please **give it a star** on GitHub! It helps others discover the project and motivates us to keep improving it.
-
-<div align="center">
-
-### [Star on GitHub](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
-
-**Thank you for your support!**
-
-**Built with care for the .NET community**
-
-[Report Bug](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Request Feature](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/issues) | [Discussions](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/discussions)
-
-</div>
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,73 +1,45 @@
-# CBDDC Documentation
-
-This folder contains the official documentation for CBDDC, published as GitHub Pages.
-
-## Documentation Structure (v0.9.0)
-
-### Getting Started
- **[Getting Started](getting-started.md)** - Installation, setup, and first steps with CBDDC
-
-### Core Documentation
- **[Architecture](architecture.md)** - Hybrid Logical Clocks, Gossip Protocol, mesh networking
- **[API Reference](api-reference.md)** - Complete API documentation with examples
- **[Querying](querying.md)** - Data querying patterns and LINQ support
-
-### Persistence & Storage
- **[Persistence Providers](persistence-providers.md)** - SQLite, EF Core, PostgreSQL comparison
- **[Deployment Modes](deployment-modes.md)** - Single-cluster deployment strategy
-
-### Networking & Security
- **[Security](security.md)** - Encryption, authentication, secure networking
- **[Conflict Resolution](conflict-resolution.md)** - LWW and Recursive Merge strategies
- **[Network Telemetry](network-telemetry.md)** - Monitoring and diagnostics
- **[Dynamic Reconfiguration](dynamic-reconfiguration.md)** - Runtime configuration changes
- **[Remote Peer Configuration](remote-peer-configuration.md)** - Managing remote peers and tracking lifecycle
- **[Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)** - Rollout notes and adoption checklist
+# CBDDC Documentation Index

-### Deployment & Operations
- **[Deployment (LAN)](deployment-lan.md)** - Platform-specific deployment guide
- **[Production Hardening](production-hardening.md)** - Configuration, monitoring, best practices
- **[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)** - Operational workflow for de-tracking and removal
-
-## Building the Documentation
-
-This documentation uses Jekyll with the Cayman theme and is automatically published via GitHub Pages.
-
-### Local Development
-
-```bash
-# Install Jekyll (requires Ruby)
-gem install bundler jekyll
-
-# Serve documentation locally
-cd docs
-jekyll serve
-
-# Open http://localhost:4000
-```
-
-### Site Configuration
-
- **_config.yml** - Jekyll configuration and site metadata
- **_layouts/default.html** - Main page layout with navigation and styling
- **_includes/nav.html** - Top navigation bar
- **_data/navigation.yml** - Sidebar navigation structure per version
-
-## Version History
-
-The documentation supports multiple versions:
- **v0.9** (current) - Latest stable release
- **v0.8** - ASP.NET Core hosting, EF Core, PostgreSQL support
- **v0.7** - Brotli compression, Protocol v4
- **v0.6** - Secure networking, conflict resolution strategies
-
-## Contributing to Documentation
-
-1. **Fix typos or improve clarity** - Submit PRs directly
-2. **Add new guides** - Create a new .md file and update navigation.yml
-3. **Test locally** - Run Jekyll locally to preview changes before submitting
-
-## Links
-
- [Main Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
- [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
+This folder contains operational and engineering documentation for CBDDC.
+
+## Core Documentation
+
+- [Architecture](architecture.md)
+- [Deployment](deployment.md)
+- [Runbook](runbook.md)
+- [Security](security.md)
+- [Access and Permissions](access.md)
+- [Troubleshooting](troubleshooting.md)
+
+## Feature Documentation
+
+- [Feature Inventory](features/README.md)
+- [Selective Collection Sync](features/selective-collection-sync.md)
+- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
+- [Secure Peer Transport](features/secure-peer-transport.md)
+- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
+
+## Additional Technical References
+
+- [Getting Started](getting-started.md)
+- [API Reference](api-reference.md)
+- [Conflict Resolution](conflict-resolution.md)
+- [Persistence Providers](persistence-providers.md)
+- [Network Telemetry](network-telemetry.md)
+- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
+- [Remote Peer Configuration](remote-peer-configuration.md)
+- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
+- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
+- [Database Sync Manager Design](database-sync-manager-design.md)
+
+## Deployment-Specific Guides
+
+- [Deployment Modes](deployment-modes.md)
+- [Deployment (LAN)](deployment-lan.md)
+- [Production Hardening](production-hardening.md)
+
+## Documentation Conventions
+
+- Use lowercase kebab-case for markdown files.
+- Keep canonical operational docs at the top level under `docs/`.
+- Store major feature documentation in `docs/features/` and keep `docs/features/README.md` updated.
--- a/docs/access.md
+++ b/docs/access.md
@@ -0,0 +1,44 @@
+# Access and Permissions
+
+This document defines the least-privilege access model for CBDDC environments.
+
+## Roles
+
+| Role | Typical Permissions | Approval Required |
+|------|---------------------|-------------------|
+| Runtime Operator | Read health/logs, restart service, run incident checks | Team lead or on-call manager |
+| Deployment Engineer | Deploy approved releases, update runtime configuration | Change approval for production |
+| Security Administrator | Manage secrets, rotate tokens, review access | Security approval |
+| Maintainer | Modify CBDDC source/docs, merge reviewed changes | Pull request review |
+
+## Least-Privilege Rules
+
+- Grant access by role, not by individual preference.
+- Use environment-specific credentials and scoped service accounts.
+- Do not share production credentials across environments.
+- Remove elevated access promptly after incident or change window.
+
+## Approval Flow
+
+1. Request access with role, environment, and business reason.
+2. Approver validates least-privilege scope.
+3. Access is granted with expiration date when applicable.
+4. Grant/revoke events are logged for auditability.
+
+## Periodic Access Review
+
+- Review active privileged access at least quarterly.
+- Remove dormant or unowned accounts immediately.
+- Validate that emergency access accounts are controlled and monitored.
+
+## Secret Handling
+
+- Store `AuthToken`, connection strings, and credentials in approved secret stores.
+- Never commit secrets to source control.
+- Rotate secrets after incidents and on scheduled cadence.
+
+## Related Documents
+
+- [Security](security.md)
+- [Runbook](runbook.md)
+- [Production Hardening](production-hardening.md)
--- a/docs/adr/0001-peer-mesh-topology.md
+++ b/docs/adr/0001-peer-mesh-topology.md
@@ -0,0 +1,19 @@
+# ADR 0001: Peer Mesh Topology
+
+## Status
+
+Accepted
+
+## Context
+
+CBDDC requires local-first synchronization without a central coordinator in LAN-focused deployments.
+
+## Decision
+
+Adopt a peer mesh topology with gossip-based synchronization and per-node oplog tracking.
+
+## Consequences
+
+- Improves resilience when individual nodes disconnect.
+- Requires strong operational tooling for lag, peer lifecycle, and pruning controls.
+- Increases importance of clear runbook and troubleshooting documentation.
--- a/docs/conflict-resolution.md
+++ b/docs/conflict-resolution.md
@@ -246,6 +246,6 @@ A: No. Conflict resolution assumes both documents have compatible schemas.
 ---

 **See Also:**
- [Getting Started](getting-started.html)
- [Architecture](architecture.html)  
- [Security](security.html)
+- [Getting Started](getting-started.md)
+- [Architecture](architecture.md)  
+- [Security](security.md)
--- a/docs/deployment.md
+++ b/docs/deployment.md
@@ -0,0 +1,61 @@
+# Deployment
+
+This document defines the canonical CBDDC deployment workflow, validation gates, and rollback procedures.
+
+## Environments
+
+- Local development: single node for functional verification.
+- Staging: multi-node mesh or hosted mode with production-like configuration.
+- Production: controlled rollout with health checks, monitoring, and rollback readiness.
+
+## Promotion Workflow
+
+1. Build and test in CI (`dotnet build`, `dotnet test`).
+2. Validate configuration and secrets for the target environment.
+3. Deploy to staging.
+4. Run smoke checks:
+   - Node starts and joins expected peers.
+   - `/health` reports healthy.
+   - Sync operations replicate across target collections.
+5. Promote to production using approved change window.
+
+## Validation Gates
+
+- No failed test suites.
+- No unresolved critical incidents.
+- Health check status is healthy or approved degraded state with mitigation.
+- Backup/restore path confirmed for the active persistence provider.
+
+## Rollback Triggers
+
+Rollback immediately when any of the following occurs:
+
+- Sync correctness regression (missed or duplicate replication events).
+- Persistent unhealthy status after remediation attempt.
+- Authentication or connectivity failure across required peers.
+- Data integrity concern reported by validation checks.
+
+## Rollback Procedure
+
+1. Stop traffic or disable rollout to newly deployed nodes.
+2. Revert to the last known-good build.
+3. Restore previous configuration and secret set.
+4. If needed, restore persistence snapshot/backup.
+5. Re-run health and replication smoke checks.
+6. Record incident details in the runbook timeline.
+
+## Emergency Changes
+
+Use emergency deployment only for incident containment:
+
+1. Capture incident reference and approver.
+2. Apply minimal scoped change.
+3. Run abbreviated smoke checks (startup, health, critical replication path).
+4. Follow up with standard post-incident review.
+
+## Related Documents
+
+- [Deployment Modes](deployment-modes.md)
+- [Deployment (LAN)](deployment-lan.md)
+- [Production Hardening](production-hardening.md)
+- [Runbook](runbook.md)
--- a/docs/features/README.md
+++ b/docs/features/README.md
@@ -0,0 +1,16 @@
+# Major Feature Inventory
+
+This index tracks CBDDC major functionality. Each feature has one canonical document.
+
+## Features
+
+- [Selective Collection Sync](selective-collection-sync.md)
+- [Peer-to-Peer Gossip Sync](peer-to-peer-gossip-sync.md)
+- [Secure Peer Transport](secure-peer-transport.md)
+- [Peer-Confirmed Pruning](peer-confirmed-pruning.md)
+
+## Maintenance Rules
+
+- Keep one file per major feature.
+- Update feature docs when APIs, operations, or security controls change.
+- Cross-link each feature to [Runbook](../runbook.md) and [Security](../security.md).
--- a/docs/features/peer-confirmed-pruning.md
+++ b/docs/features/peer-confirmed-pruning.md
@@ -0,0 +1,68 @@
+# Feature: Peer-Confirmed Pruning
+
+## Purpose and Business Outcome
+
+Prune oplog history safely while preventing data loss for active peers that have not confirmed required streams.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Retention cutoff plus peer confirmation cutoff logic.
+- Operational controls for peer tracking and de-tracking.
+
+Non-goals:
+
+- Automatic removal of retired peers without operator action.
+- Replacement for backup/restore strategy.
+
+## User and System Workflows
+
+1. Maintenance job calculates retention and confirmation cutoffs.
+2. System blocks pruning when required confirmations are missing.
+3. Operator de-tracks retired peers when appropriate.
+4. Pruning resumes once constraints are satisfied.
+
+## Interfaces, APIs, and Events Involved
+
+- Maintenance pruning scheduler
+- Peer tracking operations (`RemovePeerTrackingAsync`, `RemoveRemotePeerAsync`)
+- Health metrics for lag and missing confirmations
+
+## Permissions and Data Handling
+
+- Only approved operators should modify peer tracking state.
+- Pruning decisions must be auditable through logs and incident notes.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Accurate peer tracking metadata
+- Timely confirmation updates per source stream
+
+Failure modes:
+
+- Pruning blocked indefinitely by stale peer tracking
+- Unsafe pruning if controls are bypassed
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Monitor peers with missing confirmations and sustained lag.
+- Use [Peer Deprecation and Removal Runbook](../peer-deprecation-removal-runbook.md) for operational actions.
+
+## Rollout and Change Considerations
+
+- Introduce pruning policy changes with explicit maintenance windows.
+- Validate expected cutoff behavior in staging before production rollout.
+
+## Validation and Testability Guidance
+
+- Add tests for blocked prune when confirmations are missing.
+- Add tests for resumed prune after de-tracking retired peers.
+- Smoke test health status transitions around peer lifecycle changes.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Runbook](../runbook.md)
--- a/docs/features/peer-to-peer-gossip-sync.md
+++ b/docs/features/peer-to-peer-gossip-sync.md
@@ -0,0 +1,70 @@
+# Feature: Peer-to-Peer Gossip Sync
+
+## Purpose and Business Outcome
+
+Propagate updates across mesh nodes without a central coordinator so local-first applications remain resilient to intermittent connectivity.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Peer discovery and sync orchestration.
+- Push/pull propagation of oplog changes.
+
+Non-goals:
+
+- Strong global consistency guarantees.
+- Public internet exposure without additional network controls.
+
+## User and System Workflows
+
+1. Node starts and discovers peers.
+2. Node exchanges sync metadata with connected peers.
+3. Missing operations are requested and applied.
+4. Mesh converges over repeated gossip rounds.
+
+## Interfaces, APIs, and Events Involved
+
+- Sync orchestrator scheduling
+- TCP peer sync channels
+- Vector clock exchange and reconciliation
+
+## Permissions and Data Handling
+
+- Peers with valid authentication token can exchange replicated collection data.
+- Cluster membership should be restricted to trusted nodes.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Network reachability
+- Shared authentication material
+- Healthy persistence layer
+
+Failure modes:
+
+- Peer isolation due to network outage
+- Token mismatch blocking synchronization
+- Sustained lag under high write pressure
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Monitor `laggingPeers`, `maxLagMs`, and active peer counts.
+- Follow [Runbook](../runbook.md) playbooks for lagging or disconnected peers.
+
+## Rollout and Change Considerations
+
+- Roll out protocol-affecting changes in a controlled window.
+- Confirm backward/forward compatibility in staging mesh before production rollout.
+
+## Validation and Testability Guidance
+
+- Run multi-node integration tests with controlled partitions.
+- Validate eventual convergence after reconnect.
+- Verify no data-loss under repeated reconnect scenarios.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Access and Permissions](../access.md)
--- a/docs/features/secure-peer-transport.md
+++ b/docs/features/secure-peer-transport.md
@@ -0,0 +1,67 @@
+# Feature: Secure Peer Transport
+
+## Purpose and Business Outcome
+
+Protect replicated data in transit with authenticated and encrypted peer communication.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Secure handshake and key establishment.
+- Message confidentiality and integrity controls.
+
+Non-goals:
+
+- Data-at-rest encryption.
+- Full identity and certificate lifecycle management.
+
+## User and System Workflows
+
+1. Operator enables secure transport components.
+2. Peers perform handshake and establish session keys.
+3. Replication traffic is encrypted/authenticated.
+4. Health and logs expose secure mode status.
+
+## Interfaces, APIs, and Events Involved
+
+- `IPeerHandshakeService` / secure handshake implementation
+- Network pipeline message encryption and HMAC validation
+- Startup configuration for secure mode
+
+## Permissions and Data Handling
+
+- Secret material (`AuthToken`, key inputs) must be restricted to authorized operators.
+- Logs must avoid plaintext secret disclosure.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Consistent security mode across peers
+- Valid runtime cryptographic dependencies
+
+Failure modes:
+
+- Secure/plaintext mode mismatch
+- Handshake failure due to key/token mismatch
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Alert on repeated handshake failures.
+- Use [Runbook](../runbook.md) for incident triage and [Troubleshooting](../troubleshooting.md) for remediation.
+
+## Rollout and Change Considerations
+
+- Enable secure mode in staging first.
+- Roll production nodes in controlled order to avoid mixed-mode partitions.
+
+## Validation and Testability Guidance
+
+- Add tests for secure-to-secure success and mixed-mode rejection.
+- Validate encrypted cluster startup and sync with production-like load.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Access and Permissions](../access.md)
--- a/docs/features/selective-collection-sync.md
+++ b/docs/features/selective-collection-sync.md
@@ -0,0 +1,69 @@
+# Feature: Selective Collection Sync
+
+## Purpose and Business Outcome
+
+Allow teams to replicate only selected collections so bandwidth and operational overhead stay aligned to business-critical data.
+
+## Scope and Non-Goals
+
+Scope:
+
+- Register collections for replication using `WatchCollection()`.
+- Replicate changes for registered collections across peers.
+
+Non-goals:
+
+- Automatic replication of all database collections.
+- Schema migration management.
+
+## User and System Workflows
+
+1. Developer registers target collections in the document store.
+2. Local writes trigger CDC events.
+3. Oplog entries propagate through peer sync.
+4. Remote peers apply updates for matching collections.
+
+## Interfaces, APIs, and Events Involved
+
+- `WatchCollection(collectionName, collection, keySelector)`
+- CDC trigger pipeline
+- Oplog append and apply operations
+
+## Permissions and Data Handling
+
+- Access to source collections is controlled by host application permissions.
+- Only approved collections should be registered for sync in sensitive environments.
+
+## Dependencies and Failure Modes
+
+Dependencies:
+
+- Correct collection registration
+- Stable peer connectivity
+- Persistence availability
+
+Failure modes:
+
+- Missed replication due to unregistered collection
+- Delayed propagation during network partition
+
+## Monitoring, Alerts, and Troubleshooting Pointers
+
+- Monitor replication lag and peer confirmation metrics.
+- Use [Runbook](../runbook.md) and [Troubleshooting](../troubleshooting.md) for incident response.
+
+## Rollout and Change Considerations
+
+- Introduce new synced collections behind staged rollout.
+- Validate downstream consumer compatibility before production enablement.
+
+## Validation and Testability Guidance
+
+- Add integration tests verifying only registered collections replicate.
+- Smoke test by writing to registered and non-registered collections and confirming expected behavior.
+- Validate no unexpected collection appears in remote peers after deployment.
+
+## Related Security Controls
+
+- [Security](../security.md)
+- [Access and Permissions](../access.md)
--- a/docs/getting-started.md
+++ b/docs/getting-started.md
@@ -179,10 +179,10 @@ Choose your strategy:

 ## Next Steps

- [Architecture Overview](architecture.html) - Understand HLC, Gossip Protocol, and mesh networking
- [Persistence Providers](persistence-providers.html) - Choose the right database for your deployment
- [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
- [Security Configuration](security.html) - Encryption and authentication
- [Conflict Resolution Strategies](conflict-resolution.html) - LWW vs Recursive Merge
- [Production Hardening](production-hardening.html) - Best practices and monitoring
- [API Reference](api-reference.html) - Complete API documentation
+- [Architecture Overview](architecture.md) - Understand HLC, Gossip Protocol, and mesh networking
+- [Persistence Providers](persistence-providers.md) - Choose the right database for your deployment
+- [Deployment Modes](deployment-modes.md) - Single-cluster deployment strategy
+- [Security Configuration](security.md) - Encryption and authentication
+- [Conflict Resolution Strategies](conflict-resolution.md) - LWW vs Recursive Merge
+- [Production Hardening](production-hardening.md) - Best practices and monitoring
+- [API Reference](api-reference.md) - Complete API documentation
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,57 +1,44 @@
---
-layout: default
---
-
-# CBDDC Documentation
-
-Welcome to the CBDDC documentation for **version 0.9.0**.
-
-## What's New in v0.9.0
-
-Version 0.9.0 brings enhanced ASP.NET Core support, improved EF Core runtime stability, and refined synchronization and persistence layers.
-
- **ASP.NET Core Enhancements**: Improved sample application with better error handling
- **EF Core Stability**: Fixed runtime issues and improved persistence layer reliability
- **Sync & Persistence**: Enhanced stability across all persistence providers
- **Production Ready**: All features tested and stable for production use
-
-See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for complete details.
-
-## Documentation
-
-### Getting Started
-*   [**Getting Started**](getting-started.html) - Installation, basic setup, and your first CBDDC application
-
-### Core Concepts
-*   [Architecture](architecture.html) - Understanding HLC, Gossip Protocol, and P2P mesh networking
-*   [API Reference](api-reference.html) - Complete API documentation and examples
-*   [Querying](querying.html) - Data querying patterns and LINQ support
-
-### Persistence & Storage
-*   [Persistence Providers](persistence-providers.html) - SQLite, EF Core, PostgreSQL comparison and configuration
-*   [Deployment Modes](deployment-modes.html) - Single-cluster deployment strategy
-
-### Networking & Security
-*   [Security](security.html) - Encryption, authentication, and secure networking
-*   [Conflict Resolution](conflict-resolution.html) - LWW and Recursive Merge strategies
-*   [Network Telemetry](network-telemetry.html) - Monitoring and diagnostics
-*   [Dynamic Reconfiguration](dynamic-reconfiguration.html) - Runtime configuration and leader election
-*   [Remote Peer Configuration](remote-peer-configuration.html) - Managing remote peers and tracking lifecycle
-*   [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html) - Adoption and rollout notes
+---
+layout: default
+---

-### Deployment & Operations
-*   [Deployment (LAN)](deployment-lan.html) - Platform-specific deployment instructions
-*   [Production Hardening](production-hardening.html) - Configuration, monitoring, and best practices
-*   [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html) - Operational deprecation/removal workflow
-
-## Previous Versions
-
- [v0.8.x Documentation](v0.8/getting-started.html)
- [v0.7.x Documentation](v0.7/getting-started.html)
- [v0.6.x Documentation](v0.6/getting-started.html)
-
-## Links
-
-*   [GitHub Repository](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net)
-*   [NuGet Packages](https://www.nuget.org/packages?q=CBDDC)
-*   [Changelog](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md)
+# CBDDC Documentation
+
+Welcome to the CBDDC documentation site.
+
+## Canonical Operational Docs
+
+- [Architecture](architecture.md)
+- [Deployment](deployment.md)
+- [Runbook](runbook.md)
+- [Security](security.md)
+- [Access and Permissions](access.md)
+- [Troubleshooting](troubleshooting.md)
+
+## Feature Docs
+
+- [Feature Inventory](features/README.md)
+- [Selective Collection Sync](features/selective-collection-sync.md)
+- [Peer-to-Peer Gossip Sync](features/peer-to-peer-gossip-sync.md)
+- [Secure Peer Transport](features/secure-peer-transport.md)
+- [Peer-Confirmed Pruning](features/peer-confirmed-pruning.md)
+
+## Additional References
+
+- [Getting Started](getting-started.md)
+- [API Reference](api-reference.md)
+- [Persistence Providers](persistence-providers.md)
+- [Deployment Modes](deployment-modes.md)
+- [Deployment (LAN)](deployment-lan.md)
+- [Production Hardening](production-hardening.md)
+- [Conflict Resolution](conflict-resolution.md)
+- [Network Telemetry](network-telemetry.md)
+- [Dynamic Reconfiguration](dynamic-reconfiguration.md)
+- [Remote Peer Configuration](remote-peer-configuration.md)
+- [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
+- [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md)
+- [Database Sync Manager Design](database-sync-manager-design.md)
+
+## Release History
+
+See the [CHANGELOG](https://github.com/CBDDC/ZB.MOM.WW.CBDDC.Net/blob/main/CHANGELOG.md) for version-level changes.
--- a/docs/remote-peer-configuration.md
+++ b/docs/remote-peer-configuration.md
@@ -104,5 +104,5 @@ Fields stored per peer:
  which removes the peer from active prune-gating.
 - Re-observed peers can be re-registered and become active for tracking again.
 - For rollout steps and production operations, see:
-  - [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.html)
-  - [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html)
+  - [Upgrade: Peer-Confirmed Pruning](upgrade-peer-confirmed-pruning.md)
+  - [Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md)
--- a/docs/runbook.md
+++ b/docs/runbook.md
@@ -0,0 +1,58 @@
+# Operations Runbook
+
+This runbook is the primary operational reference for CBDDC monitoring, incident triage, escalation, and recovery.
+
+## Ownership and Escalation
+
+- Service owner: CBDDC Core Maintainers.
+- First response: local platform/application on-call team for the affected deployment.
+- Product issue escalation: open an incident issue in the CBDDC repository with logs and health payload.
+
+## Alert Triage
+
+1. Identify severity based on impact:
+   - Sev 1: Data integrity risk, sustained outage, or broad replication failure.
+   - Sev 2: Partial sync degradation or prolonged peer lag.
+   - Sev 3: Isolated node issue with workaround.
+2. Confirm current `cbddc` health check status and payload.
+3. Identify affected peers, collections, and first observed time.
+4. Apply the relevant recovery play below.
+
+## Core Diagnostics
+
+Capture these artifacts before remediation:
+
+- Health response payload (`trackedPeerCount`, `laggingPeers`, `peersWithNoConfirmation`, `maxLagMs`).
+- Application logs for sync, persistence, and network components.
+- Current runtime configuration (excluding secrets).
+- Most recent deployment identifier and change window.
+
+## Recovery Plays
+
+### Peer unreachable or lagging
+
+1. Verify network path and auth token consistency.
+2. Validate peer is still expected in topology.
+3. If peer is retired, follow [Peer Deprecation and Removal Runbook](peer-deprecation-removal-runbook.md).
+4. Recheck health status after remediation.
+
+### Persistence failure
+
+1. Verify storage path and permissions.
+2. Run integrity checks.
+3. Restore from latest valid backup if corruption is confirmed.
+4. Validate replication behavior after restore.
+
+### Configuration drift
+
+1. Compare deployed config to approved baseline.
+2. Reapply canonical settings.
+3. Restart affected service safely.
+4. Verify recovery with health and smoke checks.
+
+## Post-Incident Actions
+
+1. Record root cause and timeline.
+2. Add follow-up work items (tests, alerts, docs updates).
+3. Update affected feature docs and troubleshooting guidance.
+4. Confirm rollback and recovery instructions remain accurate.
--- a/docs/security.md
+++ b/docs/security.md
@@ -140,6 +140,6 @@ A: Security works on all target frameworks (netstandard2.0, net6.0, net8.0) with
 ---

 **See Also:**
- [Getting Started](getting-started.html)
- [Architecture](architecture.html)
- [Production Hardening](production-hardening.html)
+- [Getting Started](getting-started.md)
+- [Architecture](architecture.md)
+- [Production Hardening](production-hardening.md)
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -0,0 +1,86 @@
+# Troubleshooting
+
+This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
+
+## Peer Cannot Connect
+
+Symptoms:
+
+- Node remains disconnected from expected peers.
+- Health check reports lagging or unconfirmed peers.
+
+Likely causes:
+
+- Network path blocked (port/firewall mismatch).
+- `AuthToken` mismatch.
+- Peer configuration drift.
+
+Resolution:
+
+1. Verify TCP/UDP port configuration on both peers.
+2. Confirm shared token and node identity settings.
+3. Restart peer service and monitor logs.
+4. Recheck `cbddc` health payload.
+
+## Replication Delay or Missing Updates
+
+Symptoms:
+
+- Writes are visible locally but not on remote peers.
+- `maxLagMs` grows continuously.
+
+Likely causes:
+
+- Retired peer still tracked and gating pruning.
+- High load or transient network instability.
+- Invalid collection watch configuration.
+
+Resolution:
+
+1. Confirm affected collections are registered with `WatchCollection()`.
+2. Inspect peer confirmation metrics.
+3. If needed, de-track retired peers using the runbook.
+4. Re-run smoke sync validation after changes.
+
+## Persistence Errors
+
+Symptoms:
+
+- Startup or write failures from persistence layer.
+- Unhealthy health check due to storage exceptions.
+
+Likely causes:
+
+- File/path permission errors.
+- Storage corruption.
+- Misconfigured provider settings.
+
+Resolution:
+
+1. Validate storage path and runtime permissions.
+2. Run integrity checks.
+3. Restore from latest good backup if corruption is detected.
+4. Validate read/write and replication after restore.
+
+## Configuration Regressions After Release
+
+Symptoms:
+
+- Behavior changed immediately after deployment.
+- Multiple nodes fail with same error pattern.
+
+Likely causes:
+
+- Incorrect environment variables or appsettings values.
+- Partial rollout with incompatible settings.
+
+Resolution:
+
+1. Compare deployed configuration to approved baseline.
+2. Roll back to last known-good release if production impact is high.
+3. Redeploy with corrected configuration.
+4. Document root cause and preventive controls.
+
+## Escalation
+
+If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.
--- a/docs/upgrade-peer-confirmed-pruning.md
+++ b/docs/upgrade-peer-confirmed-pruning.md
@@ -66,4 +66,4 @@ If rollout exposes unexpected persistent degradation:
 3. Keep full peer removal for nodes that are confirmed decommissioned.

 For a detailed operator procedure, see
-[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.html).
+[Peer Deprecation & Removal Runbook](peer-deprecation-removal-runbook.md).