All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2.1 KiB
2.1 KiB
Troubleshooting
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
Peer Cannot Connect
Symptoms:
- Node remains disconnected from expected peers.
- Health check reports lagging or unconfirmed peers.
Likely causes:
- Network path blocked (port/firewall mismatch).
AuthTokenmismatch.- Peer configuration drift.
Resolution:
- Verify TCP/UDP port configuration on both peers.
- Confirm shared token and node identity settings.
- Restart peer service and monitor logs.
- Recheck
cbddchealth payload.
Replication Delay or Missing Updates
Symptoms:
- Writes are visible locally but not on remote peers.
maxLagMsgrows continuously.
Likely causes:
- Retired peer still tracked and gating pruning.
- High load or transient network instability.
- Invalid collection watch configuration.
Resolution:
- Confirm affected collections are registered with
WatchCollection(). - Inspect peer confirmation metrics.
- If needed, de-track retired peers using the runbook.
- Re-run smoke sync validation after changes.
Persistence Errors
Symptoms:
- Startup or write failures from persistence layer.
- Unhealthy health check due to storage exceptions.
Likely causes:
- File/path permission errors.
- Storage corruption.
- Misconfigured provider settings.
Resolution:
- Validate storage path and runtime permissions.
- Run integrity checks.
- Restore from latest good backup if corruption is detected.
- Validate read/write and replication after restore.
Configuration Regressions After Release
Symptoms:
- Behavior changed immediately after deployment.
- Multiple nodes fail with same error pattern.
Likely causes:
- Incorrect environment variables or appsettings values.
- Partial rollout with incompatible settings.
Resolution:
- Compare deployed configuration to approved baseline.
- Roll back to last known-good release if production impact is high.
- Redeploy with corrected configuration.
- Document root cause and preventive controls.
Escalation
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow Runbook escalation and incident procedures.