feat: add resilient reconnect and catch-up replay
This commit is contained in:
225
docs/plans/2026-03-17-runtime-reconnect-design.md
Normal file
225
docs/plans/2026-03-17-runtime-reconnect-design.md
Normal file
@@ -0,0 +1,225 @@
|
||||
# SuiteLink Runtime Reconnect Design
|
||||
|
||||
## Goal
|
||||
|
||||
Add a background receive loop and automatic reconnect/recovery to the existing SuiteLink client so subscriptions are restored automatically and update callbacks resume without caller intervention.
|
||||
|
||||
## Scope
|
||||
|
||||
This design adds:
|
||||
|
||||
- a background receive loop owned by `SuiteLinkClient`
|
||||
- automatic reconnect with bounded retry backoff
|
||||
- automatic subscription replay after reconnect
|
||||
- resumed update dispatch after replay
|
||||
|
||||
This design does not add:
|
||||
|
||||
- write queuing during reconnect
|
||||
- catch-up replay of missed values
|
||||
- secure SuiteLink V3/TLS support
|
||||
- AlarmMgr support
|
||||
|
||||
## Runtime Model
|
||||
|
||||
The current client uses explicit on-demand inbound processing. The new model shifts normal operation to a managed runtime loop.
|
||||
|
||||
There are two categories of state:
|
||||
|
||||
- durable desired state
|
||||
- configured connection options
|
||||
- caller subscription intent
|
||||
- callbacks associated with subscribed items
|
||||
- ephemeral connection state
|
||||
- current transport connection
|
||||
- current session state
|
||||
- current `itemName <-> tagId` mappings
|
||||
|
||||
Durable state survives reconnects. Ephemeral state is rebuilt on reconnect.
|
||||
|
||||
## Recommended Approach
|
||||
|
||||
Implement a supervised background receive loop inside `SuiteLinkClient`.
|
||||
|
||||
Behavior:
|
||||
|
||||
1. `ConnectAsync` establishes the initial transport/session and starts the receive loop.
|
||||
2. The receive loop reads frames continuously.
|
||||
3. Update frames are decoded and dispatched to user callbacks.
|
||||
4. EOF, transport exceptions, malformed frames, or replay failures trigger recovery.
|
||||
5. Recovery reconnects with bounded retry delays.
|
||||
6. After reconnect succeeds, the client replays all current subscriptions and resumes dispatching.
|
||||
|
||||
This keeps the public API simple and avoids forcing callers to manually poll `ProcessIncomingAsync`.
|
||||
|
||||
## State Model
|
||||
|
||||
Expand session/client lifecycle to distinguish pending vs ready vs reconnecting:
|
||||
|
||||
- `Disconnected`
|
||||
- `Connecting`
|
||||
- `ConnectSent`
|
||||
- `Ready`
|
||||
- `Reconnecting`
|
||||
- `Faulted`
|
||||
- `Disposed`
|
||||
|
||||
Definitions:
|
||||
|
||||
- `Connecting`: transport connect + handshake in progress
|
||||
- `ConnectSent`: startup connect has been sent but the runtime is not yet considered ready
|
||||
- `Ready`: background receive loop active and subscriptions can be served normally
|
||||
- `Reconnecting`: recovery loop active after a connection failure
|
||||
|
||||
`IsConnected` should reflect `Ready` only.
|
||||
|
||||
## Recovery Policy
|
||||
|
||||
Failure triggers:
|
||||
|
||||
- transport read returns `0`
|
||||
- transport exception while sending or receiving
|
||||
- malformed or unexpected frame during active runtime
|
||||
- reconnect replay failure
|
||||
|
||||
Recovery behavior:
|
||||
|
||||
- stop the current receive loop
|
||||
- mark the ephemeral session as disconnected/faulted
|
||||
- start reconnect attempts until success or explicit shutdown
|
||||
|
||||
Retry schedule:
|
||||
|
||||
- first retry immediately
|
||||
- then bounded retry delays such as:
|
||||
- 1 second
|
||||
- 2 seconds
|
||||
- 5 seconds
|
||||
- 10 seconds
|
||||
- cap the delay instead of growing without bound
|
||||
|
||||
Writes during `Reconnecting` are rejected with a clear exception.
|
||||
|
||||
## Subscription Replay
|
||||
|
||||
The client should maintain a durable subscription registry keyed by `itemName`.
|
||||
|
||||
Each entry stores:
|
||||
|
||||
- `itemName`
|
||||
- callback
|
||||
- requested tag id
|
||||
|
||||
During reconnect:
|
||||
|
||||
1. reconnect transport
|
||||
2. send handshake
|
||||
3. send connect
|
||||
4. replay every subscribed item via `ADVISE`
|
||||
5. rebuild live session mappings from fresh ACKs
|
||||
6. transition to `Ready`
|
||||
|
||||
Subscription replay is serialized and must not run concurrently with normal writes or new replay attempts.
|
||||
|
||||
## Callback Rules
|
||||
|
||||
Callbacks must never run under client locks or gates.
|
||||
|
||||
Rules:
|
||||
|
||||
- decode frames under internal synchronization
|
||||
- dispatch callbacks only after releasing gates
|
||||
- callback exceptions remain contained and do not crash the receive loop
|
||||
|
||||
## Public API Effects
|
||||
|
||||
Expected public behavior:
|
||||
|
||||
- `ConnectAsync`
|
||||
- establishes initial runtime and starts background receive
|
||||
- `SubscribeAsync`
|
||||
- records durable intent
|
||||
- advises immediately when ready
|
||||
- keeps durable subscription for replay after reconnect
|
||||
- `ReadAsync`
|
||||
- can remain implemented as a temporary subscription
|
||||
- should still use the background runtime instead of manual caller polling
|
||||
- `WriteAsync`
|
||||
- allowed only in `Ready`
|
||||
- fails during `Reconnecting`
|
||||
- `DisconnectAsync`
|
||||
- stops receive and reconnect tasks
|
||||
- tears down transport
|
||||
|
||||
`ProcessIncomingAsync` should stop being the primary runtime API. It can be retained only as an internal/test helper if still useful.
|
||||
|
||||
## Internal Changes
|
||||
|
||||
### `SuiteLinkClient`
|
||||
|
||||
Add:
|
||||
|
||||
- receive loop task
|
||||
- reconnect supervisor task or integrated recovery loop
|
||||
- cancellation tokens for runtime shutdown
|
||||
- durable subscription registry
|
||||
- reconnect backoff helper
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- own runtime lifecycle
|
||||
- coordinate reconnect attempts
|
||||
- replay subscriptions safely
|
||||
- ensure only one receive loop and one reconnect flow are active
|
||||
|
||||
### `SuiteLinkSession`
|
||||
|
||||
Continue to manage:
|
||||
|
||||
- live connection/session state
|
||||
- current `itemName <-> tagId` mappings
|
||||
- live dispatch helpers
|
||||
|
||||
Do not make it responsible for durable reconnect intent.
|
||||
|
||||
### `SubscriptionHandle`
|
||||
|
||||
Should continue to remove durable subscription intent and trigger `UNADVISE` when possible.
|
||||
|
||||
If called during reconnect/disconnect, removal of durable intent still succeeds even if wire unadvise cannot be sent.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Runtime Loop Tests
|
||||
|
||||
Add tests proving:
|
||||
|
||||
- updates received by the background loop reach callbacks
|
||||
- no manual `ProcessIncomingAsync` call is needed in normal operation
|
||||
|
||||
### Recovery Tests
|
||||
|
||||
Add tests proving:
|
||||
|
||||
- EOF triggers reconnect
|
||||
- reconnect replays handshake/connect/subscriptions
|
||||
- callback dispatch resumes after reconnect
|
||||
- writes during reconnect fail predictably
|
||||
|
||||
### Lifecycle Tests
|
||||
|
||||
Add tests proving:
|
||||
|
||||
- `DisconnectAsync` stops background tasks
|
||||
- `DisposeAsync` stops reconnect attempts
|
||||
- repeated failures do not start multiple reconnect loops
|
||||
|
||||
## Recommended Next Step
|
||||
|
||||
Create an implementation plan that breaks this into small tasks:
|
||||
|
||||
- durable subscription registry
|
||||
- background receive loop
|
||||
- reconnect loop and backoff
|
||||
- replay logic
|
||||
- runtime tests
|
||||
Reference in New Issue
Block a user