Files
suitelinkclient/docs/plans/2026-03-17-catchup-retry-implementation-plan.md
2026-03-17 11:04:19 -04:00

20 KiB

Catch-Up Replay And Advanced Retry Implementation Plan

For Codex: REQUIRED SUB-SKILL: Use executeplan to implement this plan task-by-task.

Goal: Add best-effort latest-value catch-up after reconnect and replace the fixed reconnect delay schedule with a production-grade retry policy, while also fixing the current reconnect quality issues.

Architecture: Extend the existing reconnect runtime with a small runtime-options layer, a retry-policy calculator, and a post-reconnect catch-up refresh phase. Keep reconnect success defined as restored live subscriptions, and treat catch-up as a best-effort follow-on phase that emits synthetic updates marked separately from live traffic.

Tech Stack: .NET 10, C#, xUnit, existing SuiteLink protocol/client/runtime/transport layers


Task 1: Add Runtime Option Types

Files:

  • Create: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkRuntimeOptions.cs
  • Create: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkRetryPolicy.cs
  • Create: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkCatchUpPolicy.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkConnectionOptions.cs
  • Test: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkConnectionOptionsTests.cs

Step 1: Write the failing test

[Fact]
public void ConnectionOptions_DefaultsRuntimeOptions()
{
    var options = new SuiteLinkConnectionOptions(
        host: "127.0.0.1",
        application: "App",
        topic: "Topic",
        clientName: "Client",
        clientNode: "Node",
        userName: "User",
        serverNode: "Server");

    Assert.NotNull(options.Runtime);
    Assert.Equal(SuiteLinkCatchUpPolicy.None, options.Runtime.CatchUpPolicy);
    Assert.NotNull(options.Runtime.RetryPolicy);
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter ConnectionOptions_DefaultsRuntimeOptions -v minimal Expected: FAIL because runtime options do not exist yet

Step 3: Write minimal implementation

Create:

public enum SuiteLinkCatchUpPolicy
{
    None = 0,
    RefreshLatestValue = 1
}
public sealed record class SuiteLinkRetryPolicy(
    TimeSpan InitialDelay,
    double Multiplier,
    TimeSpan MaxDelay,
    int? MaxAttempts = null,
    bool UseJitter = true)
{
    public static SuiteLinkRetryPolicy Default { get; } =
        new(TimeSpan.FromSeconds(1), 2.0, TimeSpan.FromSeconds(30));
}
public sealed record class SuiteLinkRuntimeOptions(
    SuiteLinkRetryPolicy RetryPolicy,
    SuiteLinkCatchUpPolicy CatchUpPolicy,
    TimeSpan CatchUpTimeout)
{
    public static SuiteLinkRuntimeOptions Default { get; } =
        new(SuiteLinkRetryPolicy.Default, SuiteLinkCatchUpPolicy.None, TimeSpan.FromSeconds(2));
}

Update SuiteLinkConnectionOptions to expose:

public SuiteLinkRuntimeOptions Runtime { get; }

and default it to SuiteLinkRuntimeOptions.Default.

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkConnectionOptionsTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkRuntimeOptions.cs /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkRetryPolicy.cs /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkCatchUpPolicy.cs /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkConnectionOptions.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkConnectionOptionsTests.cs
git commit -m "feat: add runtime reconnect option types"

Task 2: Add Retry Policy Delay Calculator

Files:

  • Create: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Internal/SuiteLinkRetryDelayCalculator.cs
  • Test: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/Internal/SuiteLinkRetryDelayCalculatorTests.cs

Step 1: Write the failing test

[Fact]
public void GetDelay_UsesImmediateThenExponentialCap()
{
    var policy = new SuiteLinkRetryPolicy(
        InitialDelay: TimeSpan.FromSeconds(1),
        Multiplier: 2.0,
        MaxDelay: TimeSpan.FromSeconds(30),
        UseJitter: false);

    Assert.Equal(TimeSpan.Zero, SuiteLinkRetryDelayCalculator.GetDelay(policy, 0));
    Assert.Equal(TimeSpan.FromSeconds(1), SuiteLinkRetryDelayCalculator.GetDelay(policy, 1));
    Assert.Equal(TimeSpan.FromSeconds(2), SuiteLinkRetryDelayCalculator.GetDelay(policy, 2));
    Assert.Equal(TimeSpan.FromSeconds(4), SuiteLinkRetryDelayCalculator.GetDelay(policy, 3));
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkRetryDelayCalculatorTests -v minimal Expected: FAIL because calculator does not exist yet

Step 3: Write minimal implementation

Create:

internal static class SuiteLinkRetryDelayCalculator
{
    public static TimeSpan GetDelay(SuiteLinkRetryPolicy policy, int attempt)
    {
        if (attempt == 0)
        {
            return TimeSpan.Zero;
        }

        var rawSeconds = policy.InitialDelay.TotalSeconds * Math.Pow(policy.Multiplier, attempt - 1);
        var bounded = TimeSpan.FromSeconds(Math.Min(rawSeconds, policy.MaxDelay.TotalSeconds));
        return bounded;
    }
}

Do not add jitter yet beyond the policy flag unless tests require it.

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkRetryDelayCalculatorTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Internal/SuiteLinkRetryDelayCalculator.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/Internal/SuiteLinkRetryDelayCalculatorTests.cs
git commit -m "feat: add reconnect retry delay calculator"

Task 3: Wire Retry Policy Into Reconnect Runtime

Files:

  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientReconnectTests.cs

Step 1: Write the failing test

[Fact]
public async Task Reconnect_UsesConfiguredRetryPolicy()
{
    var observed = new List<TimeSpan>();
    var options = CreateOptions() with
    {
        Runtime = new SuiteLinkRuntimeOptions(
            new SuiteLinkRetryPolicy(TimeSpan.FromSeconds(3), 3.0, TimeSpan.FromSeconds(20), UseJitter: false),
            SuiteLinkCatchUpPolicy.None,
            TimeSpan.FromSeconds(2))
    };

    var client = CreateReconnectClient(delayAsync: (delay, _) =>
    {
        observed.Add(delay);
        return Task.CompletedTask;
    });

    await client.ConnectAsync(options);
    await EventuallyReconnectAsync(client);

    Assert.Contains(TimeSpan.FromSeconds(3), observed);
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter Reconnect_UsesConfiguredRetryPolicy -v minimal Expected: FAIL because reconnect still uses a fixed schedule

Step 3: Write minimal implementation

In SuiteLinkClient:

  • remove direct use of ReconnectDelaySchedule
  • read retry policy from _connectionOptions!.Runtime.RetryPolicy
  • use SuiteLinkRetryDelayCalculator.GetDelay(policy, attempt)

Keep the current injected _delayAsync test seam.

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkClientReconnectTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientReconnectTests.cs
git commit -m "feat: apply retry policy to reconnect runtime"

Task 4: Fix Fast-Fail Writes During Reconnect

Files:

  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientWriteTests.cs

Step 1: Write the failing test

[Fact]
public async Task WriteAsync_DuringReconnect_ThrowsBeforeWaitingOnOperationGate()
{
    var client = CreateClientWithBlockedOperationGateAndReconnectState();

    var ex = await Assert.ThrowsAsync<InvalidOperationException>(
        () => client.WriteAsync("Pump001.Run", SuiteLinkValue.FromBoolean(true)));

    Assert.Contains("reconnecting", ex.Message, StringComparison.OrdinalIgnoreCase);
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter WriteAsync_DuringReconnect_ThrowsBeforeWaitingOnOperationGate -v minimal Expected: FAIL because WriteAsync currently waits on _operationGate first

Step 3: Write minimal implementation

Move the reconnect state check ahead of:

await _operationGate.WaitAsync(...)

while keeping disposed-state checks intact.

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkClientWriteTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientWriteTests.cs
git commit -m "fix: fail writes before reconnect gate contention"

Task 5: Fix Transport Reset Ownership Semantics

Files:

  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Transport/SuiteLinkTcpTransport.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Transport/ISuiteLinkReconnectableTransport.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/Transport/SuiteLinkTcpTransportTests.cs

Step 1: Write the failing test

[Fact]
public async Task ResetConnectionAsync_LeaveOpenTrue_DoesNotDisposeInjectedStream()
{
    var stream = new TrackingStream();
    await using var transport = new SuiteLinkTcpTransport(stream, leaveOpen: true);

    await transport.ResetConnectionAsync();

    Assert.False(stream.WasDisposed);
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter ResetConnectionAsync_LeaveOpenTrue_DoesNotDisposeInjectedStream -v minimal Expected: FAIL because reset currently disposes caller-owned resources

Step 3: Write minimal implementation

Update ResetConnectionAsync to respect the same ownership rule as DisposeAsync:

  • if leaveOpen is true, detach without disposing injected resources
  • if leaveOpen is false, dispose detached resources

Do not broaden interface scope unnecessarily.

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkTcpTransportTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Transport/SuiteLinkTcpTransport.cs /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Transport/ISuiteLinkReconnectableTransport.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/Transport/SuiteLinkTcpTransportTests.cs
git commit -m "fix: preserve transport ownership during reconnect reset"

Task 6: Add Update Source Metadata

Files:

  • Create: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkUpdateSource.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkTagUpdate.cs
  • Test: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkValueTests.cs

Step 1: Write the failing test

[Fact]
public void TagUpdate_DefaultSource_IsLive()
{
    var update = new SuiteLinkTagUpdate(
        "Pump001.Run",
        1,
        SuiteLinkValue.FromBoolean(true),
        0x00C0,
        1,
        DateTimeOffset.UtcNow);

    Assert.Equal(SuiteLinkUpdateSource.Live, update.Source);
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter TagUpdate_DefaultSource_IsLive -v minimal Expected: FAIL because source metadata does not exist

Step 3: Write minimal implementation

Create:

public enum SuiteLinkUpdateSource
{
    Live = 0,
    CatchUpReplay = 1
}

Add Source to SuiteLinkTagUpdate with default:

SuiteLinkUpdateSource.Live

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkTagUpdate -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkUpdateSource.cs /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkTagUpdate.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkValueTests.cs
git commit -m "feat: add update source metadata"

Task 7: Add Best-Effort Catch-Up Refresh Execution

Files:

  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Internal/SubscriptionRegistrationEntry.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientReconnectTests.cs

Step 1: Write the failing test

[Fact]
public async Task Reconnect_WithRefreshLatestValue_CanDispatchCatchUpReplay()
{
    SuiteLinkTagUpdate? catchUp = null;
    var client = CreateReconnectReplayClient(
        catchUpPolicy: SuiteLinkCatchUpPolicy.RefreshLatestValue,
        onUpdate: update =>
        {
            if (update.Source == SuiteLinkUpdateSource.CatchUpReplay)
            {
                catchUp = update;
            }
        });

    await client.ConnectAsync(CreateOptionsWithCatchUp());

    Assert.NotNull(catchUp);
    Assert.Equal(SuiteLinkUpdateSource.CatchUpReplay, catchUp.Source);
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter Reconnect_WithRefreshLatestValue_CanDispatchCatchUpReplay -v minimal Expected: FAIL because reconnect only resumes live dispatch today

Step 3: Write minimal implementation

After successful reconnect and durable subscription replay:

  • if Runtime.CatchUpPolicy == SuiteLinkCatchUpPolicy.RefreshLatestValue
  • run a sequential refresh pass over durable subscriptions
  • obtain one fresh value per item using existing temporary-read machinery or a dedicated internal refresh path
  • dispatch synthetic updates with:
Source: SuiteLinkUpdateSource.CatchUpReplay

Do not fail reconnect if one item refresh fails or times out.

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkClientReconnectTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Internal/SubscriptionRegistrationEntry.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientReconnectTests.cs
git commit -m "feat: add reconnect catch-up refresh replay"

Task 8: Make Catch-Up Partial Failure Non-Fatal

Files:

  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientReconnectTests.cs

Step 1: Write the failing test

[Fact]
public async Task Reconnect_CatchUpTimeout_DoesNotFailRecoveredSubscriptions()
{
    var client = CreateReconnectReplayClientWithTimedOutRefresh();

    await client.ConnectAsync(CreateOptionsWithCatchUp());

    await Eventually.AssertAsync(() => Assert.True(client.IsConnected));
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter Reconnect_CatchUpTimeout_DoesNotFailRecoveredSubscriptions -v minimal Expected: FAIL if catch-up failure tears down reconnect

Step 3: Write minimal implementation

Wrap each refresh item independently:

  • timeout per item from Runtime.CatchUpTimeout
  • swallow per-item failure after optionally recording internal debug signal
  • continue to remaining items

Do not change the recovered Ready/Subscribed state.

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkClientReconnectTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/SuiteLinkClient.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/SuiteLinkClientReconnectTests.cs
git commit -m "feat: tolerate partial catch-up refresh failures"

Task 9: Add Jitter Coverage Without Flaky Tests

Files:

  • Modify: /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Internal/SuiteLinkRetryDelayCalculator.cs
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/Internal/SuiteLinkRetryDelayCalculatorTests.cs

Step 1: Write the failing test

[Fact]
public void GetDelay_WithJitterEnabled_StaysWithinCap()
{
    var policy = new SuiteLinkRetryPolicy(
        InitialDelay: TimeSpan.FromSeconds(2),
        Multiplier: 2.0,
        MaxDelay: TimeSpan.FromSeconds(10),
        UseJitter: true);

    var delay = SuiteLinkRetryDelayCalculator.GetDelay(policy, 3, () => 0.5);

    Assert.InRange(delay, TimeSpan.Zero, TimeSpan.FromSeconds(10));
}

Step 2: Run test to verify it fails

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkRetryDelayCalculatorTests -v minimal Expected: FAIL because jitter injection does not exist yet

Step 3: Write minimal implementation

Add an injected random source overload:

public static TimeSpan GetDelay(SuiteLinkRetryPolicy policy, int attempt, Func<double>? nextDouble = null)

When jitter is enabled:

  • compute bounded base delay
  • apply deterministic injected random value in tests
  • keep final value within [0, MaxDelay]

Step 4: Run test to verify it passes

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx --filter SuiteLinkRetryDelayCalculatorTests -v minimal Expected: PASS

Step 5: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/src/SuiteLink.Client/Internal/SuiteLinkRetryDelayCalculator.cs /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.Tests/Internal/SuiteLinkRetryDelayCalculatorTests.cs
git commit -m "feat: add deterministic jitter coverage for retry policy"

Task 10: Update Documentation And Final Verification

Files:

  • Modify: /Users/dohertj2/Desktop/suitelinkclient/README.md
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.IntegrationTests/README.md
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/docs/plans/2026-03-17-catchup-retry-design.md
  • Modify: /Users/dohertj2/Desktop/suitelinkclient/docs/plans/2026-03-17-catchup-retry-implementation-plan.md

Step 1: Write the documentation diff

Document:

  • catch-up mode is latest-value refresh only
  • retry policy is configurable and jittered by default
  • reconnect success is separate from best-effort catch-up completion
  • writes still fail during reconnect

Step 2: Run targeted verification

Run: rg -n "catch-up|retry|reconnect|jitter|refresh latest|reconnecting" /Users/dohertj2/Desktop/suitelinkclient/README.md /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.IntegrationTests/README.md Expected: PASS with updated wording

Step 3: Run full verification

Run: dotnet test /Users/dohertj2/Desktop/suitelinkclient/SuiteLink.Client.slnx -v minimal Expected: PASS

Step 4: Commit

git add /Users/dohertj2/Desktop/suitelinkclient/README.md /Users/dohertj2/Desktop/suitelinkclient/tests/SuiteLink.Client.IntegrationTests/README.md /Users/dohertj2/Desktop/suitelinkclient/docs/plans/2026-03-17-catchup-retry-design.md /Users/dohertj2/Desktop/suitelinkclient/docs/plans/2026-03-17-catchup-retry-implementation-plan.md
git commit -m "docs: describe catch-up replay and retry policy"