Files
CBDDC/docs/troubleshooting.md
Joseph Doherty ce727eb30d
All checks were successful
CI / verify (push) Successful in 2m33s
docs: align internal docs to enterprise standards
Add canonical operations/security/access/feature docs and fix path integrity to improve onboarding and incident readiness.
2026-02-20 13:23:55 -05:00

2.1 KiB

Troubleshooting

This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.

Peer Cannot Connect

Symptoms:

  • Node remains disconnected from expected peers.
  • Health check reports lagging or unconfirmed peers.

Likely causes:

  • Network path blocked (port/firewall mismatch).
  • AuthToken mismatch.
  • Peer configuration drift.

Resolution:

  1. Verify TCP/UDP port configuration on both peers.
  2. Confirm shared token and node identity settings.
  3. Restart peer service and monitor logs.
  4. Recheck cbddc health payload.

Replication Delay or Missing Updates

Symptoms:

  • Writes are visible locally but not on remote peers.
  • maxLagMs grows continuously.

Likely causes:

  • Retired peer still tracked and gating pruning.
  • High load or transient network instability.
  • Invalid collection watch configuration.

Resolution:

  1. Confirm affected collections are registered with WatchCollection().
  2. Inspect peer confirmation metrics.
  3. If needed, de-track retired peers using the runbook.
  4. Re-run smoke sync validation after changes.

Persistence Errors

Symptoms:

  • Startup or write failures from persistence layer.
  • Unhealthy health check due to storage exceptions.

Likely causes:

  • File/path permission errors.
  • Storage corruption.
  • Misconfigured provider settings.

Resolution:

  1. Validate storage path and runtime permissions.
  2. Run integrity checks.
  3. Restore from latest good backup if corruption is detected.
  4. Validate read/write and replication after restore.

Configuration Regressions After Release

Symptoms:

  • Behavior changed immediately after deployment.
  • Multiple nodes fail with same error pattern.

Likely causes:

  • Incorrect environment variables or appsettings values.
  • Partial rollout with incompatible settings.

Resolution:

  1. Compare deployed configuration to approved baseline.
  2. Roll back to last known-good release if production impact is high.
  3. Redeploy with corrected configuration.
  4. Document root cause and preventive controls.

Escalation

If a Sev 1/Sev 2 condition cannot be resolved quickly, follow Runbook escalation and incident procedures.