CBDDC/docs/troubleshooting.md

# Troubleshooting

This guide lists recurring CBDDC failure modes, likely causes, and remediation steps.

## Peer Cannot Connect

Symptoms:

- Node remains disconnected from expected peers.
- Health check reports lagging or unconfirmed peers.

Likely causes:

- Network path blocked (port/firewall mismatch).
- `AuthToken` mismatch.
- Peer configuration drift.

Resolution:

1. Verify TCP/UDP port configuration on both peers.
2. Confirm shared token and node identity settings.
3. Restart peer service and monitor logs.
4. Recheck `cbddc` health payload.

## Replication Delay or Missing Updates

Symptoms:

- Writes are visible locally but not on remote peers.
- `maxLagMs` grows continuously.

Likely causes:

- Retired peer still tracked and gating pruning.
- High load or transient network instability.
- Invalid collection watch configuration.

Resolution:

1. Confirm affected collections are registered with `WatchCollection()`.
2. Inspect peer confirmation metrics.
3. If needed, de-track retired peers using the runbook.
4. Re-run smoke sync validation after changes.

## Persistence Errors

Symptoms:

- Startup or write failures from persistence layer.
- Unhealthy health check due to storage exceptions.

Likely causes:

- File/path permission errors.
- Storage corruption.
- Misconfigured provider settings.

Resolution:

1. Validate storage path and runtime permissions.
2. Run integrity checks.
3. Restore from latest good backup if corruption is detected.
4. Validate read/write and replication after restore.

## Configuration Regressions After Release

Symptoms:

- Behavior changed immediately after deployment.
- Multiple nodes fail with same error pattern.

Likely causes:

- Incorrect environment variables or appsettings values.
- Partial rollout with incompatible settings.

Resolution:

1. Compare deployed configuration to approved baseline.
2. Roll back to last known-good release if production impact is high.
3. Redeploy with corrected configuration.
4. Document root cause and preventive controls.

## Escalation

If a Sev 1/Sev 2 condition cannot be resolved quickly, follow [Runbook](runbook.md) escalation and incident procedures.