Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
40
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Ipc/PipeAcl.cs
Normal file
40
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Ipc/PipeAcl.cs
Normal file
@@ -0,0 +1,40 @@
|
||||
using System;
|
||||
using System.IO.Pipes;
|
||||
using System.Security.AccessControl;
|
||||
using System.Security.Principal;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Ipc;
|
||||
|
||||
/// <summary>
|
||||
/// Builds the <see cref="PipeSecurity"/> required by <c>driver-stability.md §"IPC Security"</c>:
|
||||
/// only the configured OtOpcUa server principal SID gets <c>ReadWrite | Synchronize</c>;
|
||||
/// LocalSystem and Administrators are explicitly denied. Any other authenticated user falls
|
||||
/// through to the implicit deny.
|
||||
/// </summary>
|
||||
public static class PipeAcl
|
||||
{
|
||||
public static PipeSecurity Create(SecurityIdentifier allowedSid)
|
||||
{
|
||||
if (allowedSid is null) throw new ArgumentNullException(nameof(allowedSid));
|
||||
|
||||
var security = new PipeSecurity();
|
||||
|
||||
security.AddAccessRule(new PipeAccessRule(
|
||||
allowedSid,
|
||||
PipeAccessRights.ReadWrite | PipeAccessRights.Synchronize,
|
||||
AccessControlType.Allow));
|
||||
|
||||
var localSystem = new SecurityIdentifier(WellKnownSidType.LocalSystemSid, null);
|
||||
var admins = new SecurityIdentifier(WellKnownSidType.BuiltinAdministratorsSid, null);
|
||||
|
||||
if (allowedSid != localSystem)
|
||||
security.AddAccessRule(new PipeAccessRule(localSystem, PipeAccessRights.FullControl, AccessControlType.Deny));
|
||||
if (allowedSid != admins)
|
||||
security.AddAccessRule(new PipeAccessRule(admins, PipeAccessRights.FullControl, AccessControlType.Deny));
|
||||
|
||||
// Owner = allowed SID so the deny rules can't be removed without write-DACL rights.
|
||||
security.SetOwner(allowedSid);
|
||||
|
||||
return security;
|
||||
}
|
||||
}
|
||||
160
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Ipc/PipeServer.cs
Normal file
160
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Ipc/PipeServer.cs
Normal file
@@ -0,0 +1,160 @@
|
||||
using System;
|
||||
using System.IO.Pipes;
|
||||
using System.Security.Principal;
|
||||
using System.Threading;
|
||||
using System.Threading.Tasks;
|
||||
using MessagePack;
|
||||
using Serilog;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared.Contracts;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Ipc;
|
||||
|
||||
/// <summary>
|
||||
/// Accepts one client connection at a time on a named pipe with the strict ACL from
|
||||
/// <see cref="PipeAcl"/>. Verifies the peer SID and the per-process shared secret before any
|
||||
/// RPC frame is accepted. Per <c>driver-stability.md §"IPC Security"</c>.
|
||||
/// </summary>
|
||||
public sealed class PipeServer : IDisposable
|
||||
{
|
||||
private readonly string _pipeName;
|
||||
private readonly SecurityIdentifier _allowedSid;
|
||||
private readonly string _sharedSecret;
|
||||
private readonly ILogger _logger;
|
||||
private readonly CancellationTokenSource _cts = new();
|
||||
private NamedPipeServerStream? _current;
|
||||
|
||||
public PipeServer(string pipeName, SecurityIdentifier allowedSid, string sharedSecret, ILogger logger)
|
||||
{
|
||||
_pipeName = pipeName ?? throw new ArgumentNullException(nameof(pipeName));
|
||||
_allowedSid = allowedSid ?? throw new ArgumentNullException(nameof(allowedSid));
|
||||
_sharedSecret = sharedSecret ?? throw new ArgumentNullException(nameof(sharedSecret));
|
||||
_logger = logger ?? throw new ArgumentNullException(nameof(logger));
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Accepts one connection, performs Hello handshake, then dispatches frames to
|
||||
/// <paramref name="handler"/> until EOF or cancel. Returns when the client disconnects.
|
||||
/// </summary>
|
||||
public async Task RunOneConnectionAsync(IFrameHandler handler, CancellationToken ct)
|
||||
{
|
||||
using var linked = CancellationTokenSource.CreateLinkedTokenSource(_cts.Token, ct);
|
||||
var acl = PipeAcl.Create(_allowedSid);
|
||||
|
||||
// .NET Framework 4.8 uses the legacy constructor overload that takes a PipeSecurity directly.
|
||||
_current = new NamedPipeServerStream(
|
||||
_pipeName,
|
||||
PipeDirection.InOut,
|
||||
maxNumberOfServerInstances: 1,
|
||||
PipeTransmissionMode.Byte,
|
||||
PipeOptions.Asynchronous,
|
||||
inBufferSize: 64 * 1024,
|
||||
outBufferSize: 64 * 1024,
|
||||
pipeSecurity: acl);
|
||||
|
||||
try
|
||||
{
|
||||
await _current.WaitForConnectionAsync(linked.Token).ConfigureAwait(false);
|
||||
|
||||
if (!VerifyCaller(_current, out var reason))
|
||||
{
|
||||
_logger.Warning("IPC caller rejected: {Reason}", reason);
|
||||
_current.Disconnect();
|
||||
return;
|
||||
}
|
||||
|
||||
using var reader = new FrameReader(_current, leaveOpen: true);
|
||||
using var writer = new FrameWriter(_current, leaveOpen: true);
|
||||
|
||||
// First frame must be a Hello with the correct shared secret.
|
||||
var first = await reader.ReadFrameAsync(linked.Token).ConfigureAwait(false);
|
||||
if (first is null || first.Value.Kind != MessageKind.Hello)
|
||||
{
|
||||
_logger.Warning("IPC first frame was not Hello; dropping");
|
||||
return;
|
||||
}
|
||||
|
||||
var hello = MessagePackSerializer.Deserialize<Hello>(first.Value.Body);
|
||||
if (!string.Equals(hello.SharedSecret, _sharedSecret, StringComparison.Ordinal))
|
||||
{
|
||||
await writer.WriteAsync(MessageKind.HelloAck,
|
||||
new HelloAck { Accepted = false, RejectReason = "shared-secret-mismatch" },
|
||||
linked.Token).ConfigureAwait(false);
|
||||
_logger.Warning("IPC Hello rejected: shared-secret-mismatch");
|
||||
return;
|
||||
}
|
||||
|
||||
if (hello.ProtocolMajor != Hello.CurrentMajor)
|
||||
{
|
||||
await writer.WriteAsync(MessageKind.HelloAck,
|
||||
new HelloAck { Accepted = false, RejectReason = $"major-version-mismatch-peer={hello.ProtocolMajor}-server={Hello.CurrentMajor}" },
|
||||
linked.Token).ConfigureAwait(false);
|
||||
_logger.Warning("IPC Hello rejected: major mismatch peer={Peer} server={Server}",
|
||||
hello.ProtocolMajor, Hello.CurrentMajor);
|
||||
return;
|
||||
}
|
||||
|
||||
await writer.WriteAsync(MessageKind.HelloAck,
|
||||
new HelloAck { Accepted = true, HostName = Environment.MachineName },
|
||||
linked.Token).ConfigureAwait(false);
|
||||
|
||||
while (!linked.Token.IsCancellationRequested)
|
||||
{
|
||||
var frame = await reader.ReadFrameAsync(linked.Token).ConfigureAwait(false);
|
||||
if (frame is null) break;
|
||||
|
||||
await handler.HandleAsync(frame.Value.Kind, frame.Value.Body, writer, linked.Token).ConfigureAwait(false);
|
||||
}
|
||||
}
|
||||
finally
|
||||
{
|
||||
_current.Dispose();
|
||||
_current = null;
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Runs the server continuously, handling one connection at a time. When a connection ends
|
||||
/// (clean or error), accepts the next.
|
||||
/// </summary>
|
||||
public async Task RunAsync(IFrameHandler handler, CancellationToken ct)
|
||||
{
|
||||
while (!ct.IsCancellationRequested)
|
||||
{
|
||||
try { await RunOneConnectionAsync(handler, ct).ConfigureAwait(false); }
|
||||
catch (OperationCanceledException) { break; }
|
||||
catch (Exception ex) { _logger.Error(ex, "IPC connection loop error — accepting next"); }
|
||||
}
|
||||
}
|
||||
|
||||
private bool VerifyCaller(NamedPipeServerStream pipe, out string reason)
|
||||
{
|
||||
try
|
||||
{
|
||||
pipe.RunAsClient(() =>
|
||||
{
|
||||
using var wi = WindowsIdentity.GetCurrent();
|
||||
if (wi.User is null)
|
||||
throw new InvalidOperationException("GetCurrent().User is null — cannot verify caller");
|
||||
if (wi.User != _allowedSid)
|
||||
throw new UnauthorizedAccessException(
|
||||
$"caller SID {wi.User.Value} does not match allowed {_allowedSid.Value}");
|
||||
});
|
||||
reason = string.Empty;
|
||||
return true;
|
||||
}
|
||||
catch (Exception ex) { reason = ex.Message; return false; }
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
_cts.Cancel();
|
||||
_current?.Dispose();
|
||||
_cts.Dispose();
|
||||
}
|
||||
}
|
||||
|
||||
public interface IFrameHandler
|
||||
{
|
||||
Task HandleAsync(MessageKind kind, byte[] body, FrameWriter writer, CancellationToken ct);
|
||||
}
|
||||
@@ -0,0 +1,30 @@
|
||||
using System.Threading;
|
||||
using System.Threading.Tasks;
|
||||
using MessagePack;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared.Contracts;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Ipc;
|
||||
|
||||
/// <summary>
|
||||
/// Placeholder handler that responds to the framed IPC with error responses. Replaced by the
|
||||
/// real Galaxy-backed handler when the MXAccess code move (deferred) lands.
|
||||
/// </summary>
|
||||
public sealed class StubFrameHandler : IFrameHandler
|
||||
{
|
||||
public Task HandleAsync(MessageKind kind, byte[] body, FrameWriter writer, CancellationToken ct)
|
||||
{
|
||||
// Minimal lifecycle: heartbeat ack keeps the supervisor's liveness detector happy even
|
||||
// while the data-plane is stubbed, so integration tests of the supervisor can run end-to-end.
|
||||
if (kind == MessageKind.Heartbeat)
|
||||
{
|
||||
var hb = MessagePackSerializer.Deserialize<Heartbeat>(body);
|
||||
return writer.WriteAsync(MessageKind.HeartbeatAck,
|
||||
new HeartbeatAck { SequenceNumber = hb.SequenceNumber, UtcUnixMs = hb.UtcUnixMs }, ct);
|
||||
}
|
||||
|
||||
return writer.WriteAsync(MessageKind.ErrorResponse,
|
||||
new ErrorResponse { Code = "not-implemented", Message = $"Kind {kind} is stubbed — MXAccess lift deferred" },
|
||||
ct);
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,5 @@
|
||||
// Shim — .NET Framework 4.8 doesn't ship with IsExternalInit, required for init-only setters +
|
||||
// positional records. Safe to add in our own namespace; the compiler accepts any type with this name.
|
||||
namespace System.Runtime.CompilerServices;
|
||||
|
||||
internal static class IsExternalInit;
|
||||
54
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Program.cs
Normal file
54
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Program.cs
Normal file
@@ -0,0 +1,54 @@
|
||||
using System;
|
||||
using System.Security.Principal;
|
||||
using System.Threading;
|
||||
using Serilog;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Ipc;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host;
|
||||
|
||||
/// <summary>
|
||||
/// Entry point for the <c>OtOpcUaGalaxyHost</c> Windows service / console host. Reads the
|
||||
/// pipe name, allowed-SID, and shared secret from environment (passed by the supervisor at
|
||||
/// spawn time per <c>driver-stability.md</c>).
|
||||
/// </summary>
|
||||
public static class Program
|
||||
{
|
||||
public static int Main(string[] args)
|
||||
{
|
||||
Log.Logger = new LoggerConfiguration()
|
||||
.MinimumLevel.Information()
|
||||
.WriteTo.File(
|
||||
@"%ProgramData%\OtOpcUa\galaxy-host-.log".Replace("%ProgramData%", Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData)),
|
||||
rollingInterval: RollingInterval.Day)
|
||||
.CreateLogger();
|
||||
|
||||
try
|
||||
{
|
||||
var pipeName = Environment.GetEnvironmentVariable("OTOPCUA_GALAXY_PIPE") ?? "OtOpcUaGalaxy";
|
||||
var allowedSidValue = Environment.GetEnvironmentVariable("OTOPCUA_ALLOWED_SID")
|
||||
?? throw new InvalidOperationException("OTOPCUA_ALLOWED_SID not set — supervisor must pass the server principal SID");
|
||||
var sharedSecret = Environment.GetEnvironmentVariable("OTOPCUA_GALAXY_SECRET")
|
||||
?? throw new InvalidOperationException("OTOPCUA_GALAXY_SECRET not set — supervisor must pass the per-process secret at spawn time");
|
||||
|
||||
var allowedSid = new SecurityIdentifier(allowedSidValue);
|
||||
|
||||
using var server = new PipeServer(pipeName, allowedSid, sharedSecret, Log.Logger);
|
||||
using var cts = new CancellationTokenSource();
|
||||
Console.CancelKeyPress += (_, e) => { e.Cancel = true; cts.Cancel(); };
|
||||
|
||||
Log.Information("OtOpcUaGalaxyHost starting — pipe={Pipe} allowedSid={Sid}", pipeName, allowedSidValue);
|
||||
|
||||
var handler = new StubFrameHandler();
|
||||
server.RunAsync(handler, cts.Token).GetAwaiter().GetResult();
|
||||
|
||||
Log.Information("OtOpcUaGalaxyHost stopped cleanly");
|
||||
return 0;
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
Log.Fatal(ex, "OtOpcUaGalaxyHost fatal");
|
||||
return 2;
|
||||
}
|
||||
finally { Log.CloseAndFlush(); }
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,58 @@
|
||||
using System;
|
||||
using System.Runtime.ConstrainedExecution;
|
||||
using System.Runtime.InteropServices;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Sta;
|
||||
|
||||
/// <summary>
|
||||
/// SafeHandle-style lifetime wrapper for an <c>LMXProxyServer</c> COM connection. Per Task B.3
|
||||
/// + decision #65: <see cref="ReleaseHandle"/> must call <c>Marshal.ReleaseComObject</c> until
|
||||
/// refcount = 0, then <c>UnregisterProxy</c>. The finalizer runs as a
|
||||
/// <see cref="CriticalFinalizerObject"/> to honor AppDomain-unload ordering.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// This scaffold accepts any RCW (tagged as <see cref="object"/>) so we can unit-test the
|
||||
/// release logic with a mock. The concrete wiring to <c>ArchestrA.MxAccess.LMXProxyServer</c>
|
||||
/// lands when the actual Galaxy code moves over (the part deferred to the parity gate).
|
||||
/// </remarks>
|
||||
public sealed class MxAccessHandle : SafeHandle
|
||||
{
|
||||
private object? _comObject;
|
||||
private readonly Action<object>? _unregister;
|
||||
|
||||
public MxAccessHandle(object comObject, Action<object>? unregister = null)
|
||||
: base(IntPtr.Zero, ownsHandle: true)
|
||||
{
|
||||
_comObject = comObject ?? throw new ArgumentNullException(nameof(comObject));
|
||||
_unregister = unregister;
|
||||
|
||||
// The pointer value itself doesn't matter — we're wrapping an RCW, not a native handle.
|
||||
SetHandle(new IntPtr(1));
|
||||
}
|
||||
|
||||
public override bool IsInvalid => handle == IntPtr.Zero;
|
||||
|
||||
public object? RawComObject => _comObject;
|
||||
|
||||
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
|
||||
protected override bool ReleaseHandle()
|
||||
{
|
||||
if (_comObject is null) return true;
|
||||
|
||||
try { _unregister?.Invoke(_comObject); }
|
||||
catch { /* swallow — we're in finalizer/cleanup; log elsewhere */ }
|
||||
|
||||
try
|
||||
{
|
||||
if (Marshal.IsComObject(_comObject))
|
||||
{
|
||||
while (Marshal.ReleaseComObject(_comObject) > 0) { /* loop until fully released */ }
|
||||
}
|
||||
}
|
||||
catch { /* swallow */ }
|
||||
|
||||
_comObject = null;
|
||||
SetHandle(IntPtr.Zero);
|
||||
return true;
|
||||
}
|
||||
}
|
||||
91
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs
Normal file
91
src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/Sta/StaPump.cs
Normal file
@@ -0,0 +1,91 @@
|
||||
using System;
|
||||
using System.Collections.Concurrent;
|
||||
using System.Threading;
|
||||
using System.Threading.Tasks;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Sta;
|
||||
|
||||
/// <summary>
|
||||
/// Dedicated STA thread that owns all <c>LMXProxyServer</c> COM instances. Work items are
|
||||
/// posted from any thread and dispatched on the STA. Per <c>driver-stability.md</c> Galaxy
|
||||
/// deep dive §"STA thread + Win32 message pump".
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// Phase 2 scaffold: uses a <see cref="BlockingCollection{T}"/> dispatcher instead of the real
|
||||
/// Win32 <c>GetMessage/DispatchMessage</c> pump. Real pump arrives when the v1 <c>StaComThread</c>
|
||||
/// is lifted — that's part of the deferred Galaxy code move. The apartment state and work
|
||||
/// dispatch semantics are identical so production code can be swapped in without changes.
|
||||
/// </remarks>
|
||||
public sealed class StaPump : IDisposable
|
||||
{
|
||||
private readonly Thread _thread;
|
||||
private readonly BlockingCollection<Action> _workQueue = new(new ConcurrentQueue<Action>());
|
||||
private readonly TaskCompletionSource<bool> _started = new(TaskCreationOptions.RunContinuationsAsynchronously);
|
||||
private volatile bool _disposed;
|
||||
|
||||
public int ThreadId => _thread.ManagedThreadId;
|
||||
public DateTime LastDispatchedUtc { get; private set; } = DateTime.MinValue;
|
||||
public int QueueDepth => _workQueue.Count;
|
||||
|
||||
public StaPump(string name = "Galaxy.Sta")
|
||||
{
|
||||
_thread = new Thread(PumpLoop) { Name = name, IsBackground = true };
|
||||
_thread.SetApartmentState(ApartmentState.STA);
|
||||
_thread.Start();
|
||||
}
|
||||
|
||||
public Task WaitForStartedAsync() => _started.Task;
|
||||
|
||||
/// <summary>Posts a work item; resolves once it's executed on the STA thread.</summary>
|
||||
public Task<T> InvokeAsync<T>(Func<T> work)
|
||||
{
|
||||
if (_disposed) throw new ObjectDisposedException(nameof(StaPump));
|
||||
|
||||
var tcs = new TaskCompletionSource<T>(TaskCreationOptions.RunContinuationsAsynchronously);
|
||||
_workQueue.Add(() =>
|
||||
{
|
||||
try { tcs.SetResult(work()); }
|
||||
catch (Exception ex) { tcs.SetException(ex); }
|
||||
});
|
||||
return tcs.Task;
|
||||
}
|
||||
|
||||
public Task InvokeAsync(Action work) => InvokeAsync(() => { work(); return 0; });
|
||||
|
||||
/// <summary>
|
||||
/// Health probe — returns true if a no-op work item round-trips within <paramref name="timeout"/>.
|
||||
/// Used by the supervisor; timeout means the pump is wedged and a recycle is warranted.
|
||||
/// </summary>
|
||||
public async Task<bool> IsResponsiveAsync(TimeSpan timeout)
|
||||
{
|
||||
var task = InvokeAsync(() => { });
|
||||
var completed = await Task.WhenAny(task, Task.Delay(timeout)).ConfigureAwait(false);
|
||||
return completed == task;
|
||||
}
|
||||
|
||||
private void PumpLoop()
|
||||
{
|
||||
_started.TrySetResult(true);
|
||||
try
|
||||
{
|
||||
while (!_disposed)
|
||||
{
|
||||
if (_workQueue.TryTake(out var work, Timeout.Infinite))
|
||||
{
|
||||
work();
|
||||
LastDispatchedUtc = DateTime.UtcNow;
|
||||
}
|
||||
}
|
||||
}
|
||||
catch (InvalidOperationException) { /* CompleteAdding called during dispose */ }
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
if (_disposed) return;
|
||||
_disposed = true;
|
||||
_workQueue.CompleteAdding();
|
||||
_thread.Join(TimeSpan.FromSeconds(5));
|
||||
_workQueue.Dispose();
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,64 @@
|
||||
using System;
|
||||
using System.Collections.Generic;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Stability;
|
||||
|
||||
/// <summary>
|
||||
/// Galaxy-specific RSS watchdog per <c>driver-stability.md §"Memory Watchdog Thresholds"</c>.
|
||||
/// Baseline-relative + absolute caps. Sustained-slope detection uses a rolling 30-min window.
|
||||
/// Pluggable RSS source keeps it unit-testable.
|
||||
/// </summary>
|
||||
public sealed class MemoryWatchdog
|
||||
{
|
||||
/// <summary>Absolute hard ceiling — process is force-killed above this.</summary>
|
||||
public long HardCeilingBytes { get; init; } = 1_500L * 1024 * 1024;
|
||||
|
||||
/// <summary>Sustained slope (bytes/min) above which soft recycle is scheduled.</summary>
|
||||
public long SustainedSlopeBytesPerMinute { get; init; } = 5L * 1024 * 1024;
|
||||
|
||||
public TimeSpan SlopeWindow { get; init; } = TimeSpan.FromMinutes(30);
|
||||
|
||||
private readonly long _baselineBytes;
|
||||
private readonly Queue<RssSample> _samples = new();
|
||||
|
||||
public MemoryWatchdog(long baselineBytes)
|
||||
{
|
||||
_baselineBytes = baselineBytes;
|
||||
}
|
||||
|
||||
/// <summary>Called every 30s with the current RSS. Returns the action the supervisor should take.</summary>
|
||||
public WatchdogAction Sample(long rssBytes, DateTime utcNow)
|
||||
{
|
||||
_samples.Enqueue(new RssSample(utcNow, rssBytes));
|
||||
while (_samples.Count > 0 && utcNow - _samples.Peek().TimestampUtc > SlopeWindow)
|
||||
_samples.Dequeue();
|
||||
|
||||
if (rssBytes >= HardCeilingBytes)
|
||||
return WatchdogAction.HardKill;
|
||||
|
||||
var softThreshold = Math.Max(_baselineBytes * 2, _baselineBytes + 200L * 1024 * 1024);
|
||||
var warnThreshold = Math.Max((long)(_baselineBytes * 1.5), _baselineBytes + 200L * 1024 * 1024);
|
||||
|
||||
if (rssBytes >= softThreshold) return WatchdogAction.SoftRecycle;
|
||||
if (rssBytes >= warnThreshold) return WatchdogAction.Warn;
|
||||
|
||||
if (_samples.Count >= 2)
|
||||
{
|
||||
var oldest = _samples.Peek();
|
||||
var span = (utcNow - oldest.TimestampUtc).TotalMinutes;
|
||||
if (span >= SlopeWindow.TotalMinutes * 0.9) // need ~full window to trust the slope
|
||||
{
|
||||
var delta = rssBytes - oldest.RssBytes;
|
||||
var bytesPerMin = delta / span;
|
||||
if (bytesPerMin >= SustainedSlopeBytesPerMinute)
|
||||
return WatchdogAction.SoftRecycle;
|
||||
}
|
||||
}
|
||||
|
||||
return WatchdogAction.None;
|
||||
}
|
||||
|
||||
private readonly record struct RssSample(DateTime TimestampUtc, long RssBytes);
|
||||
}
|
||||
|
||||
public enum WatchdogAction { None, Warn, SoftRecycle, HardKill }
|
||||
@@ -0,0 +1,121 @@
|
||||
using System;
|
||||
using System.IO;
|
||||
using System.IO.MemoryMappedFiles;
|
||||
using System.Runtime.InteropServices;
|
||||
using System.Text;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Stability;
|
||||
|
||||
/// <summary>
|
||||
/// Ring-buffer of the last <see cref="Capacity"/> IPC operations, written into a
|
||||
/// memory-mapped file. On hard crash the supervisor reads the MMF after the corpse is gone
|
||||
/// to see what was in flight. Thread-safe for the single-writer, multi-reader pattern.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// File layout:
|
||||
/// <code>
|
||||
/// [16-byte header: magic(4) | version(4) | capacity(4) | writeIndex(4)]
|
||||
/// [capacity × 256-byte entries: each is [8-byte utcUnixMs | 8-byte opKind | 240-byte UTF-8 message]]
|
||||
/// </code>
|
||||
/// </remarks>
|
||||
public sealed class PostMortemMmf : IDisposable
|
||||
{
|
||||
private const int Magic = 0x4F505043; // 'OPPC'
|
||||
private const int Version = 1;
|
||||
private const int HeaderBytes = 16;
|
||||
public const int EntryBytes = 256;
|
||||
private const int MessageOffset = 16;
|
||||
private const int MessageCapacity = EntryBytes - MessageOffset;
|
||||
|
||||
public int Capacity { get; }
|
||||
public string Path { get; }
|
||||
|
||||
private readonly MemoryMappedFile _mmf;
|
||||
private readonly MemoryMappedViewAccessor _accessor;
|
||||
private readonly object _writeGate = new();
|
||||
|
||||
public PostMortemMmf(string path, int capacity = 1000)
|
||||
{
|
||||
if (capacity <= 0) throw new ArgumentOutOfRangeException(nameof(capacity));
|
||||
Capacity = capacity;
|
||||
Path = path;
|
||||
|
||||
var fileBytes = HeaderBytes + capacity * EntryBytes;
|
||||
Directory.CreateDirectory(System.IO.Path.GetDirectoryName(path)!);
|
||||
|
||||
var fs = new FileStream(path, FileMode.OpenOrCreate, FileAccess.ReadWrite, FileShare.Read);
|
||||
fs.SetLength(fileBytes);
|
||||
_mmf = MemoryMappedFile.CreateFromFile(fs, null, fileBytes,
|
||||
MemoryMappedFileAccess.ReadWrite, HandleInheritability.None, leaveOpen: false);
|
||||
_accessor = _mmf.CreateViewAccessor(0, fileBytes, MemoryMappedFileAccess.ReadWrite);
|
||||
|
||||
// Initialize header if blank/garbage.
|
||||
if (_accessor.ReadInt32(0) != Magic)
|
||||
{
|
||||
_accessor.Write(0, Magic);
|
||||
_accessor.Write(4, Version);
|
||||
_accessor.Write(8, capacity);
|
||||
_accessor.Write(12, 0); // writeIndex
|
||||
}
|
||||
}
|
||||
|
||||
public void Write(long opKind, string message)
|
||||
{
|
||||
lock (_writeGate)
|
||||
{
|
||||
var idx = _accessor.ReadInt32(12);
|
||||
var offset = HeaderBytes + idx * EntryBytes;
|
||||
|
||||
_accessor.Write(offset + 0, DateTimeOffset.UtcNow.ToUnixTimeMilliseconds());
|
||||
_accessor.Write(offset + 8, opKind);
|
||||
|
||||
var msgBytes = Encoding.UTF8.GetBytes(message ?? string.Empty);
|
||||
var copy = Math.Min(msgBytes.Length, MessageCapacity - 1);
|
||||
_accessor.WriteArray(offset + MessageOffset, msgBytes, 0, copy);
|
||||
_accessor.Write(offset + MessageOffset + copy, (byte)0); // null terminator
|
||||
|
||||
var next = (idx + 1) % Capacity;
|
||||
_accessor.Write(12, next);
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>Reads all entries in order (oldest → newest). Safe to call from another process.</summary>
|
||||
public PostMortemEntry[] ReadAll()
|
||||
{
|
||||
var magic = _accessor.ReadInt32(0);
|
||||
if (magic != Magic) return [];
|
||||
|
||||
var capacity = _accessor.ReadInt32(8);
|
||||
var writeIndex = _accessor.ReadInt32(12);
|
||||
|
||||
var entries = new PostMortemEntry[capacity];
|
||||
var count = 0;
|
||||
for (var i = 0; i < capacity; i++)
|
||||
{
|
||||
var slot = (writeIndex + i) % capacity;
|
||||
var offset = HeaderBytes + slot * EntryBytes;
|
||||
|
||||
var ts = _accessor.ReadInt64(offset + 0);
|
||||
if (ts == 0) continue; // unwritten
|
||||
|
||||
var op = _accessor.ReadInt64(offset + 8);
|
||||
var msgBuf = new byte[MessageCapacity];
|
||||
_accessor.ReadArray(offset + MessageOffset, msgBuf, 0, MessageCapacity);
|
||||
var nulTerm = Array.IndexOf<byte>(msgBuf, 0);
|
||||
var msg = Encoding.UTF8.GetString(msgBuf, 0, nulTerm < 0 ? MessageCapacity : nulTerm);
|
||||
|
||||
entries[count++] = new PostMortemEntry(ts, op, msg);
|
||||
}
|
||||
|
||||
Array.Resize(ref entries, count);
|
||||
return entries;
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
_accessor.Dispose();
|
||||
_mmf.Dispose();
|
||||
}
|
||||
}
|
||||
|
||||
public readonly record struct PostMortemEntry(long UtcUnixMs, long OpKind, string Message);
|
||||
@@ -0,0 +1,40 @@
|
||||
using System;
|
||||
using System.Collections.Generic;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Stability;
|
||||
|
||||
/// <summary>
|
||||
/// Frequency-capped soft-recycle decision per <c>driver-stability.md §"Recycle Policy"</c>.
|
||||
/// Default cap: 1 soft recycle per hour. Scheduled recycle at 03:00 local; supervisor reads
|
||||
/// <see cref="ShouldSoftRecycleScheduled"/> to decide.
|
||||
/// </summary>
|
||||
public sealed class RecyclePolicy
|
||||
{
|
||||
public TimeSpan SoftRecycleCap { get; init; } = TimeSpan.FromHours(1);
|
||||
public int DailyRecycleHourLocal { get; init; } = 3;
|
||||
|
||||
private readonly List<DateTime> _recentRecyclesUtc = new();
|
||||
|
||||
/// <summary>Returns true if a soft recycle would be allowed under the frequency cap.</summary>
|
||||
public bool TryRequestSoftRecycle(DateTime utcNow, out string? reason)
|
||||
{
|
||||
_recentRecyclesUtc.RemoveAll(t => utcNow - t > SoftRecycleCap);
|
||||
if (_recentRecyclesUtc.Count > 0)
|
||||
{
|
||||
reason = $"soft-recycle frequency cap: last recycle was {(utcNow - _recentRecyclesUtc[_recentRecyclesUtc.Count - 1]).TotalMinutes:F1} min ago";
|
||||
return false;
|
||||
}
|
||||
_recentRecyclesUtc.Add(utcNow);
|
||||
reason = null;
|
||||
return true;
|
||||
}
|
||||
|
||||
public bool ShouldSoftRecycleScheduled(DateTime localNow, ref DateTime lastScheduledDateLocal)
|
||||
{
|
||||
if (localNow.Hour != DailyRecycleHourLocal) return false;
|
||||
if (localNow.Date <= lastScheduledDateLocal.Date) return false;
|
||||
|
||||
lastScheduledDateLocal = localNow.Date;
|
||||
return true;
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,36 @@
|
||||
<Project Sdk="Microsoft.NET.Sdk">
|
||||
|
||||
<PropertyGroup>
|
||||
<OutputType>Exe</OutputType>
|
||||
<TargetFramework>net48</TargetFramework>
|
||||
<!-- Decision #23: x86 required for MXAccess COM interop. Currently AnyCPU is OK because
|
||||
the actual MXAccess code lift is deferred (it stays in the v1 Host until the Phase 2
|
||||
parity gate); flip to x86 when Task B.1 "move Galaxy code" actually executes. -->
|
||||
<PlatformTarget>AnyCPU</PlatformTarget>
|
||||
<Nullable>enable</Nullable>
|
||||
<LangVersion>latest</LangVersion>
|
||||
<TreatWarningsAsErrors>true</TreatWarningsAsErrors>
|
||||
<GenerateDocumentationFile>true</GenerateDocumentationFile>
|
||||
<NoWarn>$(NoWarn);CS1591</NoWarn>
|
||||
<RootNamespace>ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host</RootNamespace>
|
||||
<AssemblyName>OtOpcUa.Driver.Galaxy.Host</AssemblyName>
|
||||
</PropertyGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<PackageReference Include="System.IO.Pipes.AccessControl" Version="5.0.0"/>
|
||||
<PackageReference Include="System.Memory" Version="4.5.5"/>
|
||||
<PackageReference Include="System.Threading.Tasks.Extensions" Version="4.5.4"/>
|
||||
<PackageReference Include="Serilog" Version="4.2.0"/>
|
||||
<PackageReference Include="Serilog.Sinks.File" Version="7.0.0"/>
|
||||
</ItemGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<ProjectReference Include="..\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Shared.csproj"/>
|
||||
</ItemGroup>
|
||||
|
||||
<ItemGroup>
|
||||
<NuGetAuditSuppress Include="https://github.com/advisories/GHSA-37gx-xxp4-5rgx"/>
|
||||
<NuGetAuditSuppress Include="https://github.com/advisories/GHSA-w3x6-4m5h-cxqf"/>
|
||||
</ItemGroup>
|
||||
|
||||
</Project>
|
||||
Reference in New Issue
Block a user