Commit Graph

7 Commits

Author SHA1 Message Date
Joseph Doherty
ad131932d3 Phase 6.4 Stream B.2-B.4 server-side — EquipmentImportBatch staging + FinaliseBatch transaction
Closes the server-side/data-layer piece of Phase 6.4 Stream B.2-B.4. The
CSV-import preview + modal UI (Stream B.3/B.5) still belongs to the Admin
UI follow-up — this PR owns the staging tables + atomic finalise alone.

Configuration:
- New EquipmentImportBatch entity (Id, ClusterId, CreatedBy, CreatedAtUtc,
  RowsStaged/Accepted/Rejected, FinalisedAtUtc?). Composite index on
  (CreatedBy, FinalisedAtUtc) powers the Admin preview modal's "my open
  batches" query.
- New EquipmentImportRow entity — one row per CSV row, 8 required columns
  from decision #117 + 9 optional from decision #139 + IsAccepted flag +
  RejectReason. FK to EquipmentImportBatch with cascade delete so
  DropBatch collapses the whole tree.
- EF migration 20260419_..._AddEquipmentImportBatch.
- SchemaComplianceTests expected tables list gains the two new tables.

Admin.Services.EquipmentImportBatchService:
- CreateBatchAsync — new header row, caller-supplied ClusterId + CreatedBy.
- StageRowsAsync(batchId, acceptedRows, rejectedRows) — bulk-inserts the
  parsed CSV rows into staging. Rejected rows carry LineNumberInFile +
  RejectReason for the preview modal. Throws when the batch is finalised.
- DropBatchAsync — removes batch + cascaded rows. Throws when the batch
  was already finalised (rollback via staging is not a time machine).
- FinaliseBatchAsync(batchId, generationId, driverInstanceId, unsLineId) —
  atomic apply. Opens an EF transaction when the provider supports it
  (SQL Server in prod; InMemory in tests skips the tx), bulk-inserts
  every accepted staging row into Equipment, stamps
  EquipmentImportBatch.FinalisedAtUtc, commits. Failure rolls back so
  Equipment never partially mutates. Idempotent-under-double-call:
  second finalise throws ImportBatchAlreadyFinalisedException.
- ListByUserAsync(createdBy, includeFinalised) — the Admin preview modal's
  backing query. OrderByDescending on CreatedAtUtc so the most-recent
  batch shows first.
- Two exception types: ImportBatchNotFoundException +
  ImportBatchAlreadyFinalisedException.

ExternalIdReservation merging (ZTag + SAPID fleet-wide uniqueness) is NOT
done here — a narrower follow-up wires it once the concurrent-insert test
matrix is green.

Tests (10 new EquipmentImportBatchServiceTests, all pass):
- CreateBatch populates Id + CreatedAtUtc + zero-ed counters.
- StageRows accepted + rejected both persist; counters advance.
- DropBatch cascades row delete.
- DropBatch after finalise throws.
- Finalise translates accepted staging rows → Equipment under the target
  GenerationId + DriverInstanceId + UnsLineId.
- Finalise twice throws.
- Finalise of unknown batch throws.
- Stage after finalise throws.
- ListByUserAsync filters by creator + finalised flag.
- Drop of unknown batch is a no-op (idempotent rollback).

Full solution dotnet test: 1235 passing (was 1225, +10). Pre-existing
Client.CLI Subscribe flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:55:39 -04:00
Joseph Doherty
7b50118b68 Phase 6.1 Stream A follow-up — DriverInstance.ResilienceConfig JSON column + parser + OtOpcUaServer wire-in
Closes the Phase 6.1 Stream A.2 "per-instance overrides bound from
DriverInstance.ResilienceConfig JSON column" work flagged as a follow-up
when Stream A.1 shipped in PR #78. Every driver can now override its Polly
pipeline policy per instance instead of inheriting pure tier defaults.

Configuration:
- DriverInstance entity gains a nullable `ResilienceConfig` string column
  (nvarchar(max)) + SQL check constraint `CK_DriverInstance_ResilienceConfig_IsJson`
  that enforces ISJSON when not null. Null = use tier defaults (decision
  #143 / unchanged from pre-Phase-6.1).
- EF migration `20260419161008_AddDriverInstanceResilienceConfig`.
- SchemaComplianceTests expected-constraint list gains the new CK name.

Core.Resilience.DriverResilienceOptionsParser:
- Pure-function parser. ParseOrDefaults(tier, json, out diag) returns the
  effective DriverResilienceOptions — tier defaults with per-capability /
  bulkhead overrides layered on top when the JSON payload supplies them.
  Partial policies (e.g. Read { retryCount: 10 }) fill missing fields from
  the tier default for that capability.
- Malformed JSON falls back to pure tier defaults + surfaces a human-readable
  diagnostic via the out parameter. Callers log the diag but don't fail
  startup — a misconfigured ResilienceConfig must not brick a working
  driver.
- Property names + capability keys are case-insensitive; unrecognised
  capability names are logged-and-skipped; unrecognised shape-level keys
  are ignored so future shapes land without a migration.

Server wire-in:
- OtOpcUaServer gains two optional ctor params: `tierLookup` (driverType →
  DriverTier) + `resilienceConfigLookup` (driverInstanceId → JSON string).
  CreateMasterNodeManager now resolves tier + JSON for each driver, parses
  via DriverResilienceOptionsParser, logs the diagnostic if any, and
  constructs CapabilityInvoker with the merged options instead of pure
  Tier A defaults.
- OpcUaApplicationHost threads both lookups through. Default null keeps
  existing tests constructing without either Func unchanged (falls back
  to Tier A + tier defaults exactly as before).

Tests (13 new DriverResilienceOptionsParserTests):
- null / whitespace / empty-object JSON returns pure tier defaults.
- Malformed JSON falls back + surfaces diagnostic.
- Read override merged into tier defaults; other capabilities untouched.
- Partial policy fills missing fields from tier default.
- Bulkhead overrides honored.
- Unknown capability skipped + surfaced in diagnostic.
- Property names + capability keys are case-insensitive.
- Every tier × every capability × empty-JSON round-trips tier defaults
  exactly (theory).

Full solution dotnet test: 1215 passing (was 1202, +13). Pre-existing
Client.CLI Subscribe flake unchanged.

Production wiring (Program.cs) example:
  Func<string, DriverTier> tierLookup = type => type switch
  {
      "Galaxy" => DriverTier.C,
      "Modbus" or "S7" => DriverTier.B,
      "OpcUaClient" => DriverTier.A,
      _ => DriverTier.A,
  };
  Func<string, string?> cfgLookup = id =>
      db.DriverInstances.AsNoTracking().FirstOrDefault(x => x.DriverInstanceId == id)?.ResilienceConfig;
  var host = new OpcUaApplicationHost(..., tierLookup: tierLookup, resilienceConfigLookup: cfgLookup);

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:21:42 -04:00
Joseph Doherty
0fcdfc7546 Phase 6.2 Stream A — LdapGroupRoleMapping entity + EF migration + CRUD service
Stream A.1-A.2 per docs/v2/implementation/phase-6-2-authorization-runtime.md.
Seed-data migration (A.3) is a separate follow-up once production LDAP group
DNs are finalised; until then CRUD via the Admin UI handles the fleet set up.

Configuration:
- New AdminRole enum {ConfigViewer, ConfigEditor, FleetAdmin} — string-stored.
- New LdapGroupRoleMapping entity with Id (surrogate PK), LdapGroup (512 chars),
  Role (AdminRole enum), ClusterId (nullable, FK to ServerCluster), IsSystemWide,
  CreatedAtUtc, Notes.
- EF config: UX_LdapGroupRoleMapping_Group_Cluster unique index on
  (LdapGroup, ClusterId) + IX_LdapGroupRoleMapping_Group hot-path index on
  LdapGroup for sign-in lookups. Cluster FK cascades on cluster delete.
- Migration 20260419_..._AddLdapGroupRoleMapping generated via `dotnet ef`.

Configuration.Services:
- ILdapGroupRoleMappingService — CRUD surface. Declared as control-plane only
  per decision #150; the OPC UA data-path evaluator must NOT depend on this
  interface (Phase 6.2 compliance check on control/data-plane separation).
  GetByGroupsAsync is the hot-path sign-in lookup.
- LdapGroupRoleMappingService (EF Core impl) enforces the write-time invariant
  "exactly one of (ClusterId populated, IsSystemWide=true)" and surfaces
  InvalidLdapGroupRoleMappingException on violation. Create auto-populates Id
  + CreatedAtUtc when omitted.

Tests (9 new, all pass) in Configuration.Tests:
- Create sets Id + CreatedAtUtc.
- Create rejects empty LdapGroup.
- Create rejects IsSystemWide=true with populated ClusterId.
- Create rejects IsSystemWide=false with null ClusterId.
- GetByGroupsAsync returns matching rows only.
- GetByGroupsAsync with empty input returns empty (no full-table scan).
- ListAllAsync orders by group then cluster.
- Delete removes the target row.
- Delete of unknown id is a no-op.

Microsoft.EntityFrameworkCore.InMemory 10.0.0 added to Configuration.Tests for
the service-level tests (schema-compliance tests still use the live SQL
fixture).

SchemaComplianceTests updated to expect the new LdapGroupRoleMapping table.

Full solution dotnet test: 1051 passing (baseline 906, Phase 6.1 shipped at
1042, Phase 6.2 Stream A adds 9 = 1051). Pre-existing Client.CLI Subscribe
flake unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 09:18:06 -04:00
Joseph Doherty
cbcaf6593a Phase 6.1 Stream E (data layer) — DriverInstanceResilienceStatus entity + DriverResilienceStatusTracker + EF migration
Ships the data + runtime layer of Stream E. The SignalR hub and Blazor /hosts
page refresh (E.2-E.3) are follow-up work paired with the visual-compliance
review per Phase 6.4 patterns — documented as a deferred follow-up below.

Configuration:
- New entity DriverInstanceResilienceStatus with:
  DriverInstanceId, HostName (composite PK),
  LastCircuitBreakerOpenUtc, ConsecutiveFailures, CurrentBulkheadDepth,
  LastRecycleUtc, BaselineFootprintBytes, CurrentFootprintBytes,
  LastSampledUtc.
- Separate from DriverHostStatus (per-host connectivity view) so a Running
  host that has tripped its breaker or is nearing its memory ceiling shows up
  distinctly on Admin /hosts. Admin page left-joins both for display.
- OtOpcUaConfigDbContext + Fluent-API config + IX_DriverResilience_LastSampled
  index for the stale-sample filter query.
- EF migration: 20260419124034_AddDriverInstanceResilienceStatus.

Core.Resilience:
- DriverResilienceStatusTracker — process-singleton in-memory tracker keyed on
  (DriverInstanceId, HostName). CapabilityInvoker + MemoryTracking +
  MemoryRecycle callers record failure/success/breaker-open/recycle/footprint
  events; a HostedService (Stream E.2 follow-up) samples this tracker every
  5 s and persists to the DB. Pure in-memory keeps tests fast + the core
  free of EF/SQL dependencies.

Tests:
- DriverResilienceStatusTrackerTests (9 new, all pass): tryget-before-write
  returns null; failures accumulate; success resets; breaker/recycle/footprint
  fields populate; per-host isolation; snapshot returns all pairs; concurrent
  writes don't lose counts.
- SchemaComplianceTests: expected-tables list updated to include the new
  DriverInstanceResilienceStatus table.

Full solution dotnet test: 1042 passing (baseline 906, +136 for Phase 6.1 so
far across Streams A/B/C/D/E.1). Pre-existing Client.CLI Subscribe flake
unchanged.

Deferred to follow-up PR (E.2/E.3):
- ResilienceStatusPublisher HostedService that samples DriverResilienceStatusTracker
  every 5 s + upserts DriverInstanceResilienceStatus rows.
- Admin FleetStatusHub SignalR hub pushing LastCircuitBreakerOpenUtc /
  CurrentBulkheadDepth / LastRecycleUtc on change.
- Admin /hosts Blazor column additions (red badge when
  ConsecutiveFailures > breakerThreshold / 2). Visual-compliance reviewer
  signoff alongside Phase 6.4 admin-ui patterns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:47:43 -04:00
Joseph Doherty
854c3bcfec Phase 6.1 Stream D — LiteDB generation-sealed config cache + ResilientConfigReader + UsingStaleConfig flag
Closes Stream D per docs/v2/implementation/phase-6-1-resilience-and-observability.md.

New Configuration.LocalCache types (alongside the existing single-file
LiteDbConfigCache):

- GenerationSealedCache — file-per-generation sealed snapshots per decision
  #148. Each SealAsync writes <cache-root>/<clusterId>/<generationId>.db as a
  read-only LiteDB file, then atomically publishes the CURRENT pointer via
  temp-file + File.Replace. Prior-generation files stay on disk for audit.
  Mixed-generation reads are structurally impossible: ReadCurrentAsync opens
  the single file named by CURRENT. Corruption of the pointer or the sealed
  file raises GenerationCacheUnavailableException — fails closed, never falls
  back silently to an older generation. TryGetCurrentGenerationId returns the
  pointer value or null for diagnostics.

- StaleConfigFlag — thread-safe (Volatile.Read/Write) bool. MarkStale when a
  read fell back to the cache; MarkFresh when a central-DB read succeeded.
  Surfaced on /healthz body and Admin /hosts (Stream C wiring already in
  place).

- ResilientConfigReader — wraps a central-DB fetch function with the Stream
  D.2 pipeline: timeout 2 s → retry N× jittered (skipped when retryCount=0) →
  fallback to the sealed cache. Toggles StaleConfigFlag per outcome. Read path
  only — the write path is expected to bypass this wrapper and fail hard on
  DB outage so inconsistent writes never land. Cancellation passes through
  and is NOT retried.

Configuration.csproj:
- Polly.Core 8.6.6 + Microsoft.Extensions.Logging.Abstractions added.

Tests (17 new, all pass):
- GenerationSealedCacheTests (10): first-boot-no-snapshot throws
  GenerationCacheUnavailableException (D.4 scenario C), seal-then-read round
  trip, sealed file is ReadOnly on disk, pointer advances to latest, prior
  generation file preserved, corrupt sealed file fails closed, missing sealed
  file fails closed, corrupt pointer fails closed (D.4 scenario B), same
  generation sealed twice is idempotent, independent clusters don't
  interfere.
- ResilientConfigReaderTests (4): central-DB success returns value + marks
  fresh; central-DB failure exhausts retries + falls back to cache + marks
  stale (D.4 scenario A); central-DB + cache both unavailable throws;
  cancellation not retried.
- StaleConfigFlagTests (3): default is fresh; toggles; concurrent writes
  converge.

Full solution dotnet test: 1033 passing (baseline 906, +127 net across Phase
6.1 Streams A/B/C/D). Pre-existing Client.CLI Subscribe flake unchanged.

Integration into Configuration read paths (DriverInstance enumeration,
LdapGroupRoleMapping fetches, etc.) + the sp_PublishGeneration hook that
writes sealed files lands in the Phase 6.1 Stream E / Admin-refresh PR where
the DB integration surfaces are already touched. Existing LiteDbConfigCache
continues serving its single-file role for the NodeBootstrap path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 08:33:32 -04:00
Joseph Doherty
8464e3f376 Phase 3 PR 33 — DriverHostStatus entity + EF migration (data-layer for LMX #7). New DriverHostStatus entity with composite key (NodeId, DriverInstanceId, HostName) persists each server node's per-host connectivity view — one row per (server node, driver instance, probe-reported host), which means a redundant 2-node cluster with one Galaxy driver reporting 3 platforms produces 6 rows because each server node owns its own runtime view of the shared host topology, not 3. Fields: NodeId (64), DriverInstanceId (64), HostName (256 — fits Galaxy FQDNs and Modbus host:port strings), State (DriverHostState enum — Unknown/Running/Stopped/Faulted, persisted as nvarchar(16) via HasConversion<string> so DBAs inspecting the table see readable state names not ordinals), StateChangedUtc + LastSeenUtc (datetime2(3) — StateChangedUtc tracks actual transitions while LastSeenUtc advances on every publisher heartbeat so the Admin UI can flag stale rows from a crashed Server independent of State), Detail (nullable 1024 — exception message from the driver's probe when Faulted, null otherwise).
DriverHostState enum lives in Configuration.Enums/ rather than reusing Core.Abstractions.HostState so the Configuration project stays free of driver-runtime dependencies (it's referenced by both the Admin process and the Server process, so pulling in the driver-abstractions assembly to every Admin build would be unnecessary weight). The server-side publisher hosted service (follow-up PR 34) will translate HostStatusChangedEventArgs.NewState to this enum on every transition.
No foreign key to ClusterNode — a Server may start reporting host status before its ClusterNode row exists (first-boot bootstrap), and we'd rather keep the status row than drop it. The Admin-side service that renders the dashboard will left-join on NodeId when presenting. Two indexes declared: IX_DriverHostStatus_Node drives the per-cluster drill-down (Admin UI joins ClusterNode on ClusterId to pick which NodeIds to fetch), IX_DriverHostStatus_LastSeen drives the stale-row query (now - LastSeen > threshold).
EF migration AddDriverHostStatus creates the table + PK + both indexes. Model snapshot updated. SchemaComplianceTests expected-tables list extended. DriverHostStatusTests (3 new cases, category SchemaCompliance, uses the shared fixture DB): composite key allows same (host, driver) across different nodes AND same (node, host) across different drivers — both real-world cases the publisher needs to support; upsert-in-place pattern (fetch-by-composite-PK, mutate, save) produces one row not two — the pattern the publisher will use; State enum persists as string not int — reading the DB via ADO.NET returns 'Faulted' not '3'.
Configuration.Tests SchemaCompliance suite: 10 pass / 0 fail (7 prior + 3 new). Configuration build clean. No Server or Admin code changes yet — publisher + /hosts page are PR 34.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:38:41 -04:00
Joseph Doherty
01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00