Files
scadalink-design/deprecated/lmxproxy/docs/plans/lmxproxy-stale-session-fix.md
Joseph Doherty 9dccf8e72f deprecate(lmxproxy): move all LmxProxy code, tests, and docs to deprecated/
LmxProxy is no longer needed. Moved the entire lmxproxy/ workspace, DCL
adapter files, and related docs to deprecated/. Removed LmxProxy registration
from DataConnectionFactory, project reference from DCL, protocol option from
UI, and cleaned up all requirement docs.
2026-04-08 15:56:23 -04:00

186 lines
7.6 KiB
Markdown

# LmxProxy Stale Session Subscription Leak Fix
## Problem
When a gRPC client disconnects abruptly, Grpc.Core (the C-core library used by the .NET Framework 4.8 server) does not reliably fire the `ServerCallContext.CancellationToken`. This means:
1. The `Subscribe` RPC in `ScadaGrpcService` blocks forever on `reader.WaitToReadAsync(context.CancellationToken)` (line 368)
2. The `finally` block with `_subscriptionManager.UnsubscribeClient(request.SessionId)` never runs
3. The `ct.Register(() => UnsubscribeClient(clientId))` in `SubscriptionManager.SubscribeAsync` also never fires (same token)
4. The old session's subscriptions leak in `SubscriptionManager._clientSubscriptions` and `_tagSubscriptions`
When the client reconnects with a new session ID, it creates duplicate subscriptions. Tags aren't cleaned up because they still have a ref-count from the leaked old session. Over time, client count grows and tag subscriptions accumulate.
The `SessionManager` does scavenge inactive sessions after 5 minutes, but it only removes the session from its own dictionary — it doesn't notify `SubscriptionManager` to clean up subscriptions.
## Fix
Bridge `SessionManager` scavenging to `SubscriptionManager` cleanup. When a session is scavenged due to inactivity, also call `SubscriptionManager.UnsubscribeClient()`.
### Step 1: Add cleanup callback to SessionManager
File: `src/ZB.MOM.WW.LmxProxy.Host/Sessions/SessionManager.cs`
Add a callback field and expose it:
```csharp
// Add after the _inactivityTimeout field (line 22)
private Action<string>? _onSessionScavenged;
/// <summary>
/// Register a callback invoked when a session is scavenged due to inactivity.
/// The callback receives the session ID.
/// </summary>
public void OnSessionScavenged(Action<string> callback)
{
_onSessionScavenged = callback;
}
```
Then in `ScavengeInactiveSessions`, invoke the callback for each scavenged session:
```csharp
// In ScavengeInactiveSessions (line 103-118), change the foreach to:
foreach (var kvp in expired)
{
if (_sessions.TryRemove(kvp.Key, out _))
{
Log.Information("Session {SessionId} scavenged (inactive since {LastActivity})",
kvp.Key, kvp.Value.LastActivity);
// Notify subscriber cleanup
try
{
_onSessionScavenged?.Invoke(kvp.Key);
}
catch (Exception ex)
{
Log.Warning(ex, "Error in session scavenge callback for {SessionId}", kvp.Key);
}
}
}
```
### Step 2: Wire up the callback in LmxProxyService
File: `src/ZB.MOM.WW.LmxProxy.Host/LmxProxyService.cs`
After both `SessionManager` and `SubscriptionManager` are created, register the callback:
```csharp
// Add after SubscriptionManager creation:
_sessionManager.OnSessionScavenged(sessionId =>
{
Log.Information("Cleaning up subscriptions for scavenged session {SessionId}", sessionId);
_subscriptionManager.UnsubscribeClient(sessionId);
});
```
Find where `_sessionManager` and `_subscriptionManager` are both initialized and add this line right after.
### Step 3: Also clean up on explicit Disconnect
This is already handled — `ScadaGrpcService.Disconnect()` (line 86) calls `_subscriptionManager.UnsubscribeClient(request.SessionId)` before terminating the session. No change needed.
### Step 4: Add proactive stream timeout (belt-and-suspenders)
The scavenger runs every 60 seconds with a 5-minute timeout. This means a leaked session could take up to 6 minutes to clean up. For faster detection, add a secondary timeout in the Subscribe RPC itself.
File: `src/ZB.MOM.WW.LmxProxy.Host/Grpc/Services/ScadaGrpcService.cs`
In the `Subscribe` method, replace the simple `context.CancellationToken` with a combined token that also expires if the session becomes invalid:
```csharp
// Replace the Subscribe method (lines 353-390) with:
public override async Task Subscribe(
Scada.SubscribeRequest request,
IServerStreamWriter<Scada.VtqMessage> responseStream,
ServerCallContext context)
{
if (!_sessionManager.ValidateSession(request.SessionId))
{
throw new RpcException(new GrpcStatus(StatusCode.Unauthenticated, "Invalid session"));
}
var reader = await _subscriptionManager.SubscribeAsync(
request.SessionId, request.Tags, context.CancellationToken);
try
{
// Use a combined approach: check both the gRPC cancellation token AND
// periodic session validity. This works around Grpc.Core not reliably
// firing CancellationToken on client disconnect.
while (true)
{
// Wait for data with a timeout so we can periodically check session validity
using var timeoutCts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(
context.CancellationToken, timeoutCts.Token);
bool hasData;
try
{
hasData = await reader.WaitToReadAsync(linkedCts.Token);
}
catch (OperationCanceledException) when (timeoutCts.IsCancellationRequested
&& !context.CancellationToken.IsCancellationRequested)
{
// Timeout expired, not a client disconnect — check if session is still valid
if (!_sessionManager.ValidateSession(request.SessionId))
{
Log.Information("Subscribe stream ending — session {SessionId} no longer valid",
request.SessionId);
break;
}
continue; // Session still valid, keep waiting
}
if (!hasData) break; // Channel completed
while (reader.TryRead(out var item))
{
var protoVtq = ConvertToProtoVtq(item.address, item.vtq);
await responseStream.WriteAsync(protoVtq);
}
}
}
catch (OperationCanceledException)
{
// Client disconnected -- normal
}
catch (Exception ex)
{
Log.Error(ex, "Subscribe stream error for session {SessionId}", request.SessionId);
throw new RpcException(new GrpcStatus(StatusCode.Internal, ex.Message));
}
finally
{
_subscriptionManager.UnsubscribeClient(request.SessionId);
}
}
```
This adds a 30-second periodic check: if no data arrives for 30 seconds, it checks whether the session is still valid. If the session was scavenged (client disconnected, 5-min timeout), the stream exits cleanly and runs the `finally` cleanup.
## Summary of Changes
| File | Change |
|------|--------|
| `Sessions/SessionManager.cs` | Add `_onSessionScavenged` callback, invoke during `ScavengeInactiveSessions` |
| `LmxProxyService.cs` | Wire `_sessionManager.OnSessionScavenged` to `_subscriptionManager.UnsubscribeClient` |
| `Grpc/Services/ScadaGrpcService.cs` | Add 30-second periodic session validity check in `Subscribe` loop |
## Testing
1. Start LmxProxy server
2. Connect a client and subscribe to tags
3. Kill the client process abruptly (not a clean disconnect)
4. Check status page — client count should still show the old session
5. Wait up to 5 minutes — session should be scavenged, subscription count should drop
6. Reconnect client — should get a clean new session, no duplicate subscriptions
7. Verify tag subscription counts match expected (no leaked refs)
## Optional: Reduce scavenge timeout for faster cleanup
In `LmxProxyService.cs` where `SessionManager` is constructed, consider reducing `inactivityTimeoutMinutes` from 5 to 2, since the Subscribe RPC now has its own 30-second validity check. The 5-minute timeout was the only cleanup path before; now it's belt-and-suspenders.