wwtools/mbproxy/docs/Operations/Troubleshooting.md

# Troubleshooting

Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.

The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log` on Windows, or `/var/log/mbproxy/mbproxy-<date>.log` on Linux. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the **Windows Application Event Log** (Windows Service) or the **local syslog / journal** (systemd) under source `mbproxy` — view the latter with `journalctl -t mbproxy` or `journalctl -u mbproxy`.

Paths and service commands below are written for Windows (`%ProgramData%`, `sc.exe`); the systemd equivalents are `/etc/mbproxy` + `/var/log/mbproxy` and `systemctl start|stop|status mbproxy`.

## Service Startup Failures

### Listener port already in use

**Symptom.** Service starts but one or more PLCs show `listener.state = "recovering"` on the status page. `plcs[].listener.lastBindError` contains the OS error text (typically `Only one usage of each socket address ... is normally permitted`).

**Where to look.**

- Log event: `mbproxy.startup.bind.failed` (Error) with `Plc`, `Port`, `Reason` properties.
- Followed periodically by retries; success eventually logs `mbproxy.listener.recovered`.
- Status fields: `plcs[].listener.state`, `plcs[].listener.lastBindError`, `plcs[].listener.recoveryAttempts`.

**Root causes.**

- Another mbproxy instance is already running against the same `appsettings.json`.
- A stale `mbproxy` process is holding the port after a non-graceful stop.
- A different service (a previous Modbus gateway, a leftover test harness) is bound to the configured `ListenPort`.

**Remediation.**

1. Identify the process holding the port:

   ```powershell
   netstat -ano | findstr :<port>
   Get-Process -Id <pid>
   ```

2. Stop the conflicting process, or change `Plcs[].ListenPort` in `appsettings.json` and save. The supervisor retries on a Polly schedule (1s / 2s / 5s / 15s / 30s, then 30s indefinitely) — watch for `mbproxy.listener.recovered` to confirm.
3. If the listener never recovers, check the Event Log for the underlying reason text and verify the configured IP address is bound on the host (the proxy binds to the host's IPs, not the PLC's).

### Admin endpoint port collision

**Symptom.** Status page is unreachable; Modbus traffic continues to flow normally.

**Where to look.**

- Log event: `mbproxy.admin.bind.failed` (Error) with `Port`, `Reason` properties.
- The matching success event `mbproxy.admin.started` is absent from the same boot.

**Root causes.**

- Another HTTP service (IIS, a sidecar dashboard, a previous mbproxy instance) is bound to `AdminPort`.
- A firewall rule is rejecting the bind on the configured port.

**Remediation.**

1. The Modbus proxy continues to forward traffic — only telemetry is affected. There is no urgency from a traffic-flow perspective.
2. Identify the process holding the admin port with the same `netstat -ano | findstr :<port>` pattern.
3. Change `Mbproxy.AdminPort` in `appsettings.json` and save. Admin rebinding is hot-reloadable; no service restart is required.

### Malformed appsettings.json at startup

**Symptom.** Service fails to enter the `RUNNING` state. `sc.exe start mbproxy` reports a startup failure.

**Where to look.**

- Rolling log at `C:\ProgramData\mbproxy\logs\` for the most recent date — startup errors include the JSON parse exception with line/column.
- Windows Event Log under source `mbproxy` for the Error-level entry mirrored from the rolling log.

**Root causes.**

- Trailing comma, unbalanced braces, or stray comment in `appsettings.json`.
- A required section (`Plcs`, `BcdTags`) is missing or has the wrong shape.
- A field that must be an integer is quoted as a string.

**Remediation.**

1. Open `C:\ProgramData\mbproxy\appsettings.json` and validate it as JSON (use any editor with JSON linting).
2. Fix the structural error reported in the log and save.
3. Start the service with `sc.exe start mbproxy`.

## Connectivity Failures Between Proxy and PLC

### Backend connect refused

**Symptom.** Upstream clients can connect to the proxy but their reads/writes either return Modbus exception 0x0B or the proxy closes the client socket. `plcs[].backend.connectsFailed` on the status page rises while `connectsSuccess` stays flat.

**Where to look.**

- Log event: `mbproxy.backend.failed` (Warning) with `Plc`, `Reason`.
- Status fields: `plcs[].backend.connectsFailed`, `plcs[].backend.connectsSuccess`.

**Root causes.**

- PLC powered off, rebooting, or its ECOM/EBC coprocessor is faulted.
- Wrong `Host` or `Port` configured for the PLC in `appsettings.json`.
- A network ACL or firewall change is blocking the proxy host from reaching the PLC on TCP 502.
- The H2-ECOM100 already has its cap of 4 simultaneous TCP clients in use — the 5th connection is refused at the device.

**Remediation.**

1. Confirm the PLC is reachable from the proxy host:

   ```powershell
   Test-NetConnection -ComputerName <plc-ip> -Port 502
   ```

2. Verify the host/port in `appsettings.json` matches the PLC's actual settings (see `docs/Reference/mbtcp_settings.JPG` for the as-deployed values).
3. If `Test-NetConnection` succeeds but the proxy still fails, inspect the upstream client count for that PLC on the status page — if it is at 4 and a new connect attempt fires, the ECOM cap is the cause.
4. If the PLC has rebooted, the supervisor retries automatically on the Polly backend-connect pipeline (3 attempts at 100ms / 500ms / 2000ms per upstream request).

### Backend disconnect cascade

**Symptom.** All upstream clients for a single PLC disconnect at the same instant. `plcs[].backend.disconnectCascades` on the status page increments by the number of upstream clients that were attached at the time. Upstream clients reconnect on their next request.

**Where to look.**

- Log event: `mbproxy.multiplex.backend.disconnected` (Warning) with `Plc`, `UpstreamCount`, `InFlightCount`, `Reason`.
- Status field: `plcs[].backend.disconnectCascades`.

**Root causes.**

- PLC rebooted, reset its ECOM, or dropped the TCP session.
- A middlebox (switch, firewall) timed out the idle connection. The DL205/DL260 family does not emit TCP keepalive, so idle paths die silently after typically 2–5 minutes.
- A network event (link flap, switch reset) closed the path.

**Remediation.**

1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause. Keepalive is enabled by default and should already be preventing this — confirm `Connection.Keepalive.Enabled` is `true` and that `BackendHeartbeatIdleMs` is comfortably below the middlebox idle timeout. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).

### Backend keepalive heartbeat failing

**Symptom.** A PLC's backend connection is torn down while idle — no client was actively talking to it. `plcs[].backend.backendIdleDisconnects` increments and the upstream clients (if any were attached) are cascaded.

**Where to look.**

- Log events: `mbproxy.keepalive.heartbeat.timeout` (Warning) followed by `mbproxy.keepalive.backend.idle_disconnect` (Information).
- Status fields: `plcs[].backend.backendHeartbeatsSent`, `backendHeartbeatsFailed`, `backendIdleDisconnects`.

**Root causes.**

- The ECOM is reachable at the IP layer but no longer answering Modbus (firmware hang, ECOM reset mid-session).
- The path died between heartbeats and the heartbeat was the first request to discover it — this is the feature working as intended (the failure is found during idle, not on a client request).
- `BackendHeartbeatProbeAddress` points at an address the PLC rejects. The default (0 = `V0`) is safe on DL205/DL260; only an operator override could break this.

**Remediation.**

1. A single idle-disconnect that recovers on the next client request needs no action — the proxy reconnected the path proactively.
2. Repeated idle-disconnects on one PLC mean it keeps going dark while idle. Investigate the device and the network path; the proxy has no state to repair.
3. If `backendHeartbeatsFailed` climbs but the PLC answers real client requests fine, check that `BackendHeartbeatProbeAddress` is a register the device actually serves.

### Request timeout watchdog firing

**Symptom.** Upstream clients receive Modbus exception `0x0B` (Gateway Target Device Failed To Respond) with the original transaction ID preserved. The backend socket stays up — only individual requests time out.

**Where to look.**

- Log event: `mbproxy.multiplex.request.timeout` (Warning) with `Plc`, `ProxyTxId`, `OriginalTxId`, `Fc`, `ElapsedMs`.
- Status field: `plcs[].backend.lastRoundTripMs` (EWMA over recent successful round-trips — climbs as the PLC slows down).

**Root causes.**

- PLC scan time has grown beyond `Connection.BackendRequestTimeoutMs` (default 3000) under load.
- A PLC firmware quirk is dropping responses or echoing the wrong MBAP transaction ID.
- In test environments only, pymodbus 3.13.0's concurrent-multiplexed-request bug delivers the response under a different `OriginalTxId` than was sent — see [`../Testing/Simulator.md`](../Testing/Simulator.md).

**Remediation.**

1. Confirm the PLC is healthy — the EWMA in `plcs[].backend.lastRoundTripMs` should sit well below the configured timeout. If it is creeping up, the PLC itself is overloaded.
2. If the PLC's scan time legitimately exceeds the default, raise `Connection.BackendRequestTimeoutMs`. The change is hot-reloadable; the next request uses the new value.
3. The proxy does not retry timed-out FC06 / FC16 — they are non-idempotent on BCD tags and a partial-applied multi-register write could leave a 32-bit pair mid-transition. Upstream clients are responsible for their own retry policy.

## Configuration Hot-Reload Failures

### Reload rejected

**Symptom.** A save to `%ProgramData%\mbproxy\appsettings.json` is not picked up. The running config stays in effect. `service.configReloadRejectedCount` on the status page increments by one; `service.configLastReloadUtc` does not advance.

**Where to look.**

- Log event: `mbproxy.config.reload.rejected` (Error) with the joined `Errors` string.
- Status fields: `service.configReloadCount`, `service.configReloadRejectedCount`, `service.configLastReloadUtc`.

**Root causes.**

- Malformed JSON (parse error).
- Duplicate `Plcs[].ListenPort` across two PLCs.
- Duplicate BCD `Address` within one tag list.
- A 32-bit BCD pair whose high register overlaps with a separate 16-bit entry at the next address.
- A `CacheTtlMs` (or per-PLC `DefaultCacheTtlMs`) exceeding 60 000 ms without `Cache.AllowLongTtl = true` to opt in.
- Schema error (a required field is missing or has the wrong type).

**Remediation.**

1. Read the `Errors` property on the rejection log event — it lists every validation failure for the rejected file as a single semicolon-separated string.
2. Fix the file and save again. The next valid write is accepted and `mbproxy.config.reload.applied` is logged.
3. Reload is all-or-nothing — there is no partial application. The previous valid config keeps running until the next valid write.
4. For the full validation rule set, see [`../Features/HotReload.md`](../Features/HotReload.md).

## BCD Rewriter Anomalies

### Partial-overlap warnings

**Symptom.** The status page's `plcs[].pdus.partialBcdWarnings` counter rises. Upstream clients see raw nibble values instead of decoded integers for a 32-bit BCD pair.

**Where to look.**

- Log event: `mbproxy.rewrite.partial_bcd` (Warning) with `Plc`, `Address`, `ClientStart`, `ClientQty`.
- Status field: `plcs[].pdus.partialBcdWarnings`.

**Root causes.**

- An upstream client requested quantity = 1 at the low address of a configured 32-bit BCD pair (the proxy will not rewrite half a pair).
- An upstream client read or wrote the high address of a 32-bit pair on its own.
- A client tag definition specifies the wrong word width for the configured BCD address.

**Remediation.**

1. Fix the client's tag definition: 32-bit BCD addresses must be accessed as quantity = 2 starting at the low address.
2. If the client genuinely intends to read half the pair (rare; not a normal workflow), remove the 32-bit entry from the BCD tag list and replace it with a single 16-bit entry at the address the client uses.
3. Background reading: [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) covers the rewriter pipeline and the partial-overlap policy.

### Invalid BCD values

**Symptom.** The status page's `plcs[].pdus` block is normal but `mbproxy.rewrite.invalid_bcd` warnings appear in the log. The affected register passes through raw.

**Where to look.**

- Log event: `mbproxy.rewrite.invalid_bcd` (Warning) with `Plc`, `Address`, `RawValue` (hex), `Direction` (`Read` or `Write`).
- Status field: invalid-BCD warnings are folded into `plcs[].pdus.partialBcdWarnings` only when the warning is partial; pure invalid-BCD events do not have a dedicated counter — log search is the primary diagnostic.

**Root causes.**

- The tag is misconfigured as BCD when the PLC is actually storing binary at that address. `0x1A2B` is not valid BCD because the nibble `A` is outside 0–9.
- The PLC ladder program wrote a non-BCD value to a register configured as a BCD tag (operator bug on the PLC side).
- An upstream client is writing a value the proxy cannot encode into BCD (out-of-range decimal — for example, decimal 9 999 999 into a 16-bit BCD slot whose maximum is 9999).

**Remediation.**

1. Look at the `RawValue` in the log event. If it consistently contains nibbles `A`–`F`, the tag is almost certainly not BCD at the PLC — remove it from the BCD tag list.
2. If the value is occasional and only appears under certain operator actions, inspect the PLC ladder program for code paths that write to that address.
3. If the warning fires on writes (`Direction=Write`), the upstream client is sending a binary integer the proxy cannot represent in BCD. Validate the client-side value range.

## Performance Anomalies

### Backend EWMA latency spiking

**Symptom.** `plcs[].backend.lastRoundTripMs` on the status page climbs from its normal range (typically a few milliseconds on a healthy ECOM) toward `Connection.BackendRequestTimeoutMs`. Eventually some requests start timing out and the request-timeout symptom kicks in.

**Where to look.**

- Status field: `plcs[].backend.lastRoundTripMs` (EWMA with α = 0.2 over recent successful round-trips).
- If timeouts begin, `mbproxy.multiplex.request.timeout` events appear.

**Root causes.**

- PLC is under unusually heavy ladder load and its Modbus scan slot is starving.
- Network congestion between the proxy host and the PLC.
- The PLC is sharing its ECOM module with other Modbus clients (the proxy is not the only consumer).

**Remediation.**

1. Check the PLC's own diagnostics for scan-time growth or watchdog warnings.
2. Verify the proxy is not the only consumer — if the ECOM is also serving another upstream tool, the two are contending for the same serialised processing slot.
3. If the EWMA stabilises at a higher-but-still-safe value, consider raising `Connection.BackendRequestTimeoutMs` so individual slow responses do not start timing out.

### Multiplexer queue depth growing

**Symptom.** `plcs[].backend.queueDepth` on the status page is non-zero and trending up rather than oscillating near zero. The backend is being asked for more frames per unit time than it can serialise.

**Where to look.**

- Status field: `plcs[].backend.queueDepth` (current depth of the outbound channel feeding the backend writer task).
- Cross-reference: `plcs[].clients.connected` (upstream demand) and `plcs[].backend.lastRoundTripMs` (backend service rate).

**Root causes.**

- More concurrent upstream clients are issuing reads than the ECOM's serialised loop can satisfy. The DL205/DL260 family processes Modbus requests one at a time.
- A burst of large FC03/FC04 quantities is consuming wire time per request.

**Remediation.**

1. Enable the Phase-10 read coalescer if it is off: set `Resilience.ReadCoalescing.Enabled = true` in `appsettings.json`. Overlapping FC03/FC04 reads on the same address range fan out from a single backend round-trip — see [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md).
2. Opt heavy-read tags into the response cache by setting `CacheTtlMs > 0` per tag (or `DefaultCacheTtlMs` per PLC). Cache hits never touch the backend — see [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
3. Reduce upstream client poll rates against the affected PLC if neither feature is appropriate.
4. Background reading on the per-PLC connection model and the queue's role: [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md).

### Coalescing dead-upstream count rising

**Symptom.** `plcs[].backend.coalescedResponseToDeadUpstream` rises steadily. Coalesced reads complete on the backend, but the upstream client that asked for the read has already disconnected before the response is fanned out.

**Where to look.**

- Log event: `mbproxy.coalesce.dead_upstream` (Debug) with `Plc`, `UnitId`, `Fc`, `Start`, `Qty`.
- Status field: `plcs[].backend.coalescedResponseToDeadUpstream`.

**Root causes.**

- Upstream Modbus clients are configured with a read timeout shorter than the backend's actual response time. They disconnect and reconnect before the response arrives.
- Upstream clients are deliberately short-lived (single-shot pollers that connect, send one request, close).
- Network instability is killing upstream sockets mid-request.

**Remediation.**

1. This metric is informational and often benign. A small steady rate against short-lived pollers is expected.
2. If the rate is high enough to matter, verify upstream client (NModbus, generic Modbus clients) read timeouts exceed `Connection.BackendRequestTimeoutMs` plus a margin for jitter.
3. Cross-check `plcs[].backend.lastRoundTripMs` — if the backend is genuinely slow, the dead-upstream metric is a follow-on symptom of the latency-spike entry above; address that first.

## Response Cache Anomalies

### Cache hit ratio low when expected high

**Symptom.** `plcs[].backend.cacheHitCount` is not rising even though the tag was opted into the cache. Reads are still hitting the backend.

**Where to look.**

- Status fields: `plcs[].backend.cacheHitCount`, `plcs[].backend.cacheMissCount`, `plcs[].backend.cacheEntryCount`.
- Log event: `mbproxy.cache.miss` (Debug) — turn the log level up to confirm misses are firing for the expected addresses.

**Root causes.**

- The tag's `CacheTtlMs` is unset (null) and the per-PLC `DefaultCacheTtlMs` is `0` (the default). Cache is opt-in; absent a positive TTL, every read misses.
- The last config reload was rejected, so the cache TTL change never took effect. Check `service.configReloadRejectedCount`.
- Writes to the same address range are arriving fast enough to invalidate every cached entry before it is reused. Inspect `cacheInvalidations` next.

**Remediation.**

1. Confirm the configured cache fields appear in `/status.json` for the PLC. If `cacheEntryCount` is `0` after a sustained read load, the cache is not wired for that tag.
2. Verify the most recent `service.configLastReloadUtc` matches the time you saved the file. If not, the reload was rejected — see the "Reload rejected" entry above.
3. For the cache wiring rules and per-tag-versus-per-PLC precedence, see [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).

### Cache invalidations storming

**Symptom.** `plcs[].backend.cacheInvalidations` rises at a rate close to the read rate. Cache hits are happening but each one is followed quickly by an invalidation.

**Where to look.**

- Status fields: `plcs[].backend.cacheInvalidations`, `plcs[].backend.cacheHitCount`, `plcs[].pdus.byFc.fc06`, `plcs[].pdus.byFc.fc16`.
- Log event: `mbproxy.cache.invalidated` (Debug) with `Plc`, `UnitId`, `WriteStart`, `WriteQty`, `Count`.

**Root causes.**

- Frequent FC06 / FC16 writes target the same address range as the cached reads. The cache invalidates correctly on every overlapping write — but if writes outpace reads, the cache provides little net benefit.
- The cached read range is larger than necessary and overlaps with unrelated write traffic.
- The TTL is high relative to the natural data churn rate at the PLC.

**Remediation.**

1. Compare `cacheInvalidations` to `cacheHitCount`. When invalidation rate approaches read rate, the cache is doing its job but is not buying anything.
2. Lower the TTL on the affected tag (or remove it from the cache entirely).
3. Verify the cached read range matches only the addresses the upstream actually needs — narrower ranges reduce overlap with write traffic.

## Service Stop / Restart

### Service will not stop cleanly within the graceful drain

**Symptom.** `sc.exe stop mbproxy` takes the full `Connection.GracefulShutdownTimeoutMs` (default 10 000) or longer to return. The shutdown log line indicates non-zero in-flight requests at cancel time.

**Where to look.**

- Log event: `mbproxy.shutdown.complete` (Information) with `InFlightAtCancel`, `ElapsedMs`.
- Windows Event Log for any Error-level events emitted during the shutdown window.

**Root causes.**

- `Connection.GracefulShutdownTimeoutMs` is shorter than the slowest in-flight request can complete in.
- An in-flight request is stuck because the backend is unresponsive — the request will never return; only the deadline ends it.
- The fleet is in a sustained busy state at the moment of stop, with many in-flight requests, and they cannot all complete within the configured budget.

**Remediation.**

1. Inspect `InFlightAtCancel` on the shutdown log line. Zero means the drain succeeded; non-zero means that many requests were cancelled by the deadline.
2. Raise `Connection.GracefulShutdownTimeoutMs` if a slow-but-healthy backend is the cause. The change applies on the next `ApplicationStopping` event — restart the service to pick it up.
3. If non-zero `InFlightAtCancel` correlates with `mbproxy.multiplex.request.timeout` events in the same window, the backend was unresponsive at stop time and no timeout extension would have helped — the proxy correctly proceeded with shutdown.
4. The drain phase cancels remaining work cleanly; the service always reaches `STOPPED`. Persistent failure to reach `STOPPED` indicates a Windows service-control issue, not an mbproxy issue.

## Related Documentation

- [Status page](./StatusPage.md)
- [Configuration](./Configuration.md)
- [Log event catalogue](../Reference/LogEvents.md)
- [Connection model](../Architecture/ConnectionModel.md)
- [Response cache](../Architecture/ResponseCache.md)
- [Hot reload validation rules](../Features/HotReload.md)
- [BCD rewriting](../Features/BcdRewriting.md)
- [Read coalescing](../Architecture/ReadCoalescing.md)
- [pymodbus simulator quirks](../Testing/Simulator.md)