All checks were successful
CI / verify (push) Successful in 2m33s
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
87 lines
2.1 KiB
Markdown
87 lines
2.1 KiB
Markdown
# Troubleshooting
|
|
|
|
This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.
|
|
|
|
## Peer Cannot Connect
|
|
|
|
Symptoms:
|
|
|
|
- Node remains disconnected from expected peers.
|
|
- Health check reports lagging or unconfirmed peers.
|
|
|
|
Likely causes:
|
|
|
|
- Network path blocked (port/firewall mismatch).
|
|
- `AuthToken` mismatch.
|
|
- Peer configuration drift.
|
|
|
|
Resolution:
|
|
|
|
1. Verify TCP/UDP port configuration on both peers.
|
|
2. Confirm shared token and node identity settings.
|
|
3. Restart peer service and monitor logs.
|
|
4. Recheck `cbddc` health payload.
|
|
|
|
## Replication Delay or Missing Updates
|
|
|
|
Symptoms:
|
|
|
|
- Writes are visible locally but not on remote peers.
|
|
- `maxLagMs` grows continuously.
|
|
|
|
Likely causes:
|
|
|
|
- Retired peer still tracked and gating pruning.
|
|
- High load or transient network instability.
|
|
- Invalid collection watch configuration.
|
|
|
|
Resolution:
|
|
|
|
1. Confirm affected collections are registered with `WatchCollection()`.
|
|
2. Inspect peer confirmation metrics.
|
|
3. If needed, de-track retired peers using the runbook.
|
|
4. Re-run smoke sync validation after changes.
|
|
|
|
## Persistence Errors
|
|
|
|
Symptoms:
|
|
|
|
- Startup or write failures from persistence layer.
|
|
- Unhealthy health check due to storage exceptions.
|
|
|
|
Likely causes:
|
|
|
|
- File/path permission errors.
|
|
- Storage corruption.
|
|
- Misconfigured provider settings.
|
|
|
|
Resolution:
|
|
|
|
1. Validate storage path and runtime permissions.
|
|
2. Run integrity checks.
|
|
3. Restore from latest good backup if corruption is detected.
|
|
4. Validate read/write and replication after restore.
|
|
|
|
## Configuration Regressions After Release
|
|
|
|
Symptoms:
|
|
|
|
- Behavior changed immediately after deployment.
|
|
- Multiple nodes fail with same error pattern.
|
|
|
|
Likely causes:
|
|
|
|
- Incorrect environment variables or appsettings values.
|
|
- Partial rollout with incompatible settings.
|
|
|
|
Resolution:
|
|
|
|
1. Compare deployed configuration to approved baseline.
|
|
2. Roll back to last known-good release if production impact is high.
|
|
3. Redeploy with corrected configuration.
|
|
4. Document root cause and preventive controls.
|
|
|
|
## Escalation
|
|
|
|
If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.
|