fix(db): classify non-SqlException DB outages as transient; propagate cancellation (#7)

ExecuteWriteAsync only caught SqlException, so a live outage surfacing as
InvalidOperationException/SocketException/IOException/TimeoutException escaped
unclassified and crashed the script actor instead of buffering. Mirror the HTTP
path: propagate OperationCanceledException on cancellation, classify transport
exceptions as transient (buffer+retry), let unexpected exceptions propagate.
This commit is contained in:
Joseph Doherty
2026-06-15 14:03:25 -04:00
parent d05270640d
commit de375ff7ea
4 changed files with 335 additions and 21 deletions
@@ -1,3 +1,6 @@
using System.Data.Common;
using System.IO;
using System.Net.Sockets;
using Microsoft.Data.SqlClient;
namespace ZB.MOM.WW.ScadaBridge.ExternalSystemGateway;
@@ -84,6 +87,60 @@ public static class SqlErrorClassifier
return IsTransient(exception.Number);
}
/// <summary>
/// Determines whether an arbitrary <see cref="Exception"/> represents a
/// transient database failure — the SQL-path parallel of
/// <see cref="ErrorClassifier.IsTransient(System.Exception)"/> on the HTTP path.
/// </summary>
/// <remarks>
/// <para>
/// A live DB outage does not always surface as a <see cref="SqlException"/>:
/// once the underlying connection / socket is torn down, the driver raises
/// transport-level exceptions instead. These are <b>retryable</b> — a retry
/// can succeed once the server is reachable again — so they are classified
/// transient (buffered to store-and-forward) rather than escaping unclassified
/// to crash the calling Script Execution Actor. The transient set:
/// </para>
/// <list type="bullet">
/// <item><see cref="InvalidOperationException"/> — connection-state error (e.g. "the connection is not open" / pooled connection broken).</item>
/// <item><see cref="IOException"/> — transport read/write failure mid-session.</item>
/// <item><see cref="SocketException"/> — TCP-level failure (connection refused/reset/timed out).</item>
/// <item><see cref="TimeoutException"/> — command / connection timeout surfaced as a CLR <see cref="TimeoutException"/>.</item>
/// <item><see cref="TaskCanceledException"/> — driver-level cancellation/timeout NOT tied to a caller token (the caller-token case is handled before classification — see the gateway's ordered catches).</item>
/// <item>Any <see cref="DbException"/> that is NOT a <see cref="SqlException"/> — a provider/driver transport error (a real <see cref="SqlException"/> is classified by error number via the overloads above, never here).</item>
/// </list>
/// <para>
/// <b>Everything else is NOT transient</b> and must propagate, exactly as the
/// HTTP path lets genuinely-unexpected exceptions escape past its
/// <c>catch (Exception ex) when (ErrorClassifier.IsTransient(ex))</c> filter.
/// Authoring bugs (<see cref="ArgumentException"/>, <see cref="NullReferenceException"/>,
/// etc.) are loud, fixable failures — silently buffering and retrying them
/// forever would hide the bug.
/// </para>
/// </remarks>
/// <param name="exception">The exception to classify.</param>
/// <returns><see langword="true"/> for a transport/connection/timeout/driver exception; otherwise <see langword="false"/>.</returns>
public static bool IsTransient(Exception exception)
{
ArgumentNullException.ThrowIfNull(exception);
// A real SqlException is classified by its error number (the overloads
// above), never by type — fall back to the number-based policy so an
// unknown SqlException stays permanent (fail-fast) rather than being
// swept up as transient by the DbException catch-all below.
if (exception is SqlException sql)
{
return IsTransient(sql);
}
return exception is InvalidOperationException
or IOException
or SocketException
or TimeoutException
or TaskCanceledException
or DbException; // any non-SqlException DbException (SqlException handled above)
}
/// <summary>
/// Classifies a <see cref="SqlException"/> and rethrows it as the matching
/// strongly-typed failure: <see cref="TransientDatabaseException"/> for a