Commit Graph

40 Commits

Author SHA1 Message Date
Joseph Doherty
4695a5c88e Phase 6 — Draft 4 implementation plans covering v2 unimplemented features + adversarial review + adjustments. After drivers were paused per user direction, audited the v2 plan for features documented-but-unshipped and identified four coherent tracks that had no implementation plan at all. Each plan follows the docs/v2/implementation/phase-*.md template (DRAFT status, branch name, Stream A-E task breakdown, Compliance Checks, Risks, Completion Checklist). docs/v2/implementation/phase-6-1-resilience-and-observability.md (243 lines) covers Polly resilience pipelines wired to every capability interface, Tier A/B/C runtime enforcement (memory watchdog generalized beyond Galaxy, scheduled recycle per decision #67, wedge detection), health endpoints on :4841, structured Serilog with correlation IDs, LiteDB local-cache fallback per decision #36. phase-6-2-authorization-runtime.md (145 lines) wires ACL enforcement on every OPC UA Read/Write/Subscribe/Call path + LDAP-group-to-admin-role grants per decisions #105 and #129 -- runtime permission-trie evaluator over the 6-level Cluster/Namespace/UnsArea/UnsLine/Equipment/Tag hierarchy, per-session cache invalidated on generation-apply + LDAP-cache expiry. phase-6-3-redundancy-runtime.md (165 lines) lands the non-transparent warm/hot redundancy runtime per decisions #79-85: dynamic ServiceLevel node, ServerUriArray peer broadcast, mid-apply dip via sp_PublishGeneration hook, operator-driven role transition (no auto-election -- plan remains explicit about what's out of scope). phase-6-4-admin-ui-completion.md (178 lines) closes Phase 1 Stream E completion-checklist items that never landed: UNS drag-reorder + impact preview, Equipment CSV import, 5-identifier search, draft-diff viewer enhancements, OPC 40010 _base Identification field exposure per decisions #138-139. Each plan then got a Codex adversarial-review pass (codex mcp tool, read-only sandbox, synchronous). Reviews explicitly targeted decision-log conflicts, API-shape assumptions, unbounded blast radius, under-specified state transitions, and testing holes. Appended 'Adversarial Review — 2026-04-19' section to each plan with numbered findings (severity / finding / why-it-matters / adjustment accepted). Review surfaced real substantive issues that the initial drafts glossed over: Phase 6.1 auto-retry conflicting with decisions #44-45 no-auto-write-retry rule; Phase 6.1 per-driver-instance pipeline breaking decision #35's per-device isolation; Phase 6.1 recycle/watchdog at Tier A/B breaching decisions #73-74 Tier-C-only constraint; Phase 6.2 conflating control-plane LdapGroupRoleMapping with data-plane ACL grants; Phase 6.2 missing Browse enforcement entirely; Phase 6.2 subscription re-authorization policy unresolved between create-time-only and per-publish; Phase 6.3 ServiceLevel=0 colliding with OPC UA Part 5 Maintenance semantics; Phase 6.3 ServerUriArray excluding self (spec-bug); Phase 6.3 apply-window counter race on cancellation; Phase 6.3 client cutover for Kepware/Aveva OI Gateway is unverified hearsay; Phase 6.4 stale UNS impact preview overwriting concurrent draft edits; Phase 6.4 identifier contract drifting from admin-ui.md canonical set (ZTag/MachineCode/SAPID/EquipmentId/EquipmentUuid, not ZTag/SAPID/UniqueId/Alias1/Alias2); Phase 6.4 CSV import atomicity internally contradictory (single txn vs chunked inserts); Phase 6.4 OPC 40010 field list not matching decision #139. Every finding has an adjustment in the plan doc -- plans are meant to be executable from the next session with the critique already baked in rather than a clean draft that would run into the same issues at implementation time. Codex thread IDs cited in each plan's review section for reproducibility. Pure documentation PR -- no code changes. Plans are DRAFT status; each becomes its own implementation phase with its own entry-gate + exit-gate when business prioritizes. 2026-04-19 03:15:00 -04:00
Joseph Doherty
56d8af8bdb Phase 2 PR 61 -- Close V1_ARCHIVE_STATUS.md; Phase 2 Streams D + E done. Purely a documentation-closure PR. The v1 archive deletion itself happened across earlier PRs: PR 2 on phase-2-stream-d archive-marked the four v1 projects (IsTestProject=false so dotnet test slnx bypassed them); Phase 3 PR 18 deleted the archived project source trees. What remained on disk was stale bin/obj residue from pre-deletion builds -- git never tracked those, so removing them from the working tree is cosmetic only (no source-file diff in this PR). What this PR actually changes: V1_ARCHIVE_STATUS.md is rewritten from 'Deletion plan (Phase 2 PR 3)' pre-work prose to a CLOSED retrospective that (a) lists all five v1 directories as deleted with check-marks (src/OtOpcUa.Host, src/Historian.Aveva, tests/Historian.Aveva.Tests, tests/Tests.v1Archive, tests/IntegrationTests), (b) names the parity-bar tests that now fill the role the 494 v1 tests originally held (Driver.Galaxy.E2E cross-FX subprocess parity + stability-findings regression, per-component *.Tests projects, Driver.Modbus.IntegrationTests, LiveStack/ smoke tests), and (c) gives the closure timeline connecting PR 2 -> Phase 3 PR 18 -> this PR 61. Also added the Modbus TCP driver family as parity coverage that didn't exist in v1 (DL205 + S7-1500 + Mitsubishi MELSEC via pymodbus sim). Stream D (retire legacy Host) has been effectively done since Phase 3 PR 18; Stream E (parity validation) is done since PR 2 landed the Driver.Galaxy.E2E project with HostSubprocessParityTests + HierarchyParityTests + StabilityFindingsRegressionTests. This PR exists to definitively close the two pending Phase 2 tasks on the task list and give future-me (or anyone picking up Phase 2 retrospectives) a single 'what actually happened' doc instead of a 'what we plan to do' prose that didn't match reality. dotnet build ZB.MOM.WW.OtOpcUa.slnx: 0 errors, 200 warnings (all xunit1051 cancellation-token analyzer advisories, unchanged from v2 tip). No test regressions -- no source code changed. 2026-04-18 23:20:54 -04:00
8c89d603e8 Merge pull request 'Phase 3 PR 55 -- Mitsubishi MELSEC Modbus TCP quirks research doc' (#54) from phase-3-pr55-mitsubishi-research-doc into v2 2026-04-18 22:54:09 -04:00
Joseph Doherty
c506ea298a Phase 3 PR 55 -- Mitsubishi MELSEC Modbus TCP quirks research document. 451-line doc at docs/v2/mitsubishi.md mirroring the docs/v2/dl205.md template for the MELSEC family (Q-series + QJ71MT91, L-series + LJ71MT91, iQ-R + RJ71EN71, iQ-R built-in Ethernet, iQ-F FX5U built-in, FX3U + FX3U-ENET / FX3U-ENET-P502, FX3GE built-in). Like Siemens S7, MELSEC Modbus is a patchwork of per-site-configured add-on modules rather than a fixed firmware stack, but the MELSEC-specific traps are different enough to warrant their own document. Key findings worth flagging for the PR 58+ implementation track: (1) MODULE NAMING TRAP -- QJ71MB91 is SERIAL RTU, not TCP. The Q-series TCP module is QJ71MT91. Driver docs + config UI should surface this clearly because the confusion costs operators hours when they try to connect to an RS-232 module via Ethernet. (2) NO CANONICAL MAPPING -- every MELSEC Modbus site has a unique 'Modbus Device Assignment Parameter' block of up to 16 assignments (each binding a MELSEC device range like D0..D1023 to a Modbus-address range); the driver must treat the mapping as runtime config, not device-family profile. (3) X/Y BASE DEPENDS ON FAMILY -- Q/L/iQ-R use HEX notation for X/Y (X20 = decimal 32), FX/iQ-F use OCTAL (X20 = decimal 16, same as DL260); iQ-F has a GX Works3 project toggle that can flip this. Single biggest off-by-N source in MELSEC driver code -- driver address helper must take a family selector. (4) Word order CDAB across Q/L/iQ-R/iQ-F by default (CPU-level, not module-level) -- no user-configurable swap on the server side. FX5U's SWAP instruction is for CLIENT mode only. Driver Mitsubishi profile default must be ByteOrder.WordSwap, matching DL260 but OPPOSITE of Siemens S7. (5) D-registers are BINARY by default (opposite of DL205's BCD-by-default). FNC 18 BCD / FNC 19 BIN instructions confirm binary-by-default in the ladder. Caller must explicitly opt-in to Bcd16/Bcd32 tags when the ladder stores BCD, same pattern as DL205 but the default is inverted. (6) FX5U FIRMWARE GATE -- needs firmware >= 1.060 for native Modbus TCP server; older firmware is client-only. Surface a clear capability error on connect. (7) FX3U PORT 502 SPLIT -- the standard FX3U-ENET cannot bind port 502 (lower port range restricted on the firmware); only FX3U-ENET-P502 can. FX3U-ENET-ADP has no Modbus at all and is a common operator mis-purchase -- driver should surface 'module does not support Modbus' as a distinct error, not 'connection refused'. (8) QJ71MT91 does NOT support FC22 (Mask Write) or FC23 (Read-Write Multiple). iQ-R and iQ-F do. Driver bulk-read optimization must gate on module capability. (9) MAX CONNECTIONS -- 16 simultaneous on Q/L/iQ-R, 8 on FX5U and FX3U-ENET. (10) STOP-mode writes -- configurable on Q/L/iQ-R/iQ-F (default = accept writes even in STOP), always rejected with exception 04 on FX3U-ENET. Per-model test differentiation section names the tests Mitsubishi_QJ71MT91_*, Mitsubishi_FX5U_*, Mitsubishi_FX3U_ENET_*, with a shared Mitsubishi_Common_* fixture for CDAB-word-order + binary-not-BCD + standard-exception-codes tests. 17 cited references including primary Mitsubishi manuals (SH-080446 for QJ71MT91, JY997D56101 for FX5, SH-081259 for iQ-R Ethernet, JY997D18101 for FX3U-ENET) plus Ignition / Kepware / Fernhill / HMS third-party driver release notes. Three unconfirmed rumours flagged explicitly: iQ-R RJ71EN71 early firmware rumoured ABCD word order (no primary source), QJ71MT91 firmware < 2010-05 FC15 odd-byte-count truncation (forum report only), FX3U-ENET firmware < 1.14 out-of-order TxId echoes under load (unreproducible on bench). Pure documentation PR -- no code, no tests. Per-quirk implementation lands in PRs 58+. Research conducted 2026-04-18. 2026-04-18 22:51:28 -04:00
Joseph Doherty
9e2b5b330f Phase 3 PR 54 -- Siemens S7 Modbus TCP quirks research document. 485-line doc at docs/v2/s7.md mirroring the docs/v2/dl205.md template for the Siemens SIMATIC S7 family (S7-1200 / S7-1500 / S7-300 / S7-400 / ET 200SP / CP 343-1 / CP 443-1 / CP 343-1 Lean / MODBUSPN). Siemens S7 is fundamentally different from DL260: there is no fixed Modbus memory map baked into firmware -- every deployment runs MB_SERVER (S7-1200/1500/ET 200SP), MODBUSCP (S7-300/400 + CP), or MODBUSPN (S7-300/400 PN) library blocks wired up to user DBs via the MB_HOLD_REG / ADDR parameters. The driver's job is therefore to handle per-site CONFIG rather than per-family QUIRKS, and the doc makes that explicit. Key findings worth flagging for the PR 56+ implementation track: (1) S7 has no fixed memory map -- must accept per-site DriverConfig, cannot assume vendor-standard layout. (2) MB_SERVER requires NON-optimized DBs in TIA Portal; optimized DBs cause the library to return STATUS 0x8383 on every access -- the single most common S7 Modbus deployment bug in the field. (3) Word order is ABCD by default (big-endian bytes + big-endian words) across all Siemens S7 Modbus paths, which is the OPPOSITE of DL260 CDAB -- the Modbus driver's S7 profile default must be ByteOrder.BigEndian, not WordSwap. (4) MB_SERVER listens on ONE port per FB instance; multi-client support requires running MB_SERVER on 502 / 503 / 504 / ... simultaneously -- most clients assume port 502 multiplexes, which is wrong on S7. (5) CP 343-1 Lean is SERVER-ONLY and requires the separate 2XV9450-1MB00 MODBUS TCP CP library license; client mode calls return immediate error on Lean. (6) MB_SERVER does NOT filter Unit ID, accepts any value. Means the driver can't use Unit ID to detect 'direct vs gateway' topology. (7) FC23 Read-Write Multiple, FC22 Mask Write, FC20/21 File Records, FC43 Device Identification all return exception 01 Illegal Function on every S7 variant -- the driver MUST NOT attempt bulk-read optimisation via FC23 when talking to S7. (8) STOP-mode read/write behaviour is non-deterministic across firmware bands: reads may return cached data (library internal buffer), writes may succeed-silently or return exception 04 depending on CPU firmware version -- flagged as 'driver treats both as unavailable, do not distinguish'. Unconfirmed rumours flagged separately: 'V2.0+ reverses float byte order' claim (cited but not reproduced), STOP-mode caching location (folklore, no primary source). Per-model test differentiation section names the tests as S7_<model>_<behavior> matching the DL205 template convention (e.g. S7_1200_MB_SERVER_requires_non_optimized_DB, S7_343_1_Lean_rejects_client_mode, S7_FC23_returns_IllegalFunction). 31 cited references across the Siemens Industry Online Support entry-ID system (68011496 for MB_SERVER FAQ, etc.), TIA Portal library manuals, and three third-party driver vendor release notes (Kepware, Ignition, FactoryTalk). This is a pure documentation PR -- no code, no tests, no csproj changes. Per-quirk implementation lands in PRs 56+. Research conducted 2026-04-18 against latest publicly-available Siemens documentation; STOP-mode behaviour and MB_SERVER versioning specifically cross-checked against Siemens forum answers from 2024-2025. 2026-04-18 22:50:51 -04:00
Joseph Doherty
a05b84858d Phase 3 PR 43 — Swap ModbusPal to pymodbus for the integration-test simulator. Replaces the .xmpp profiles shipped in PR 42 with pymodbus 3.13.0 ModbusSimulatorServer JSON configs in tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests/Pymodbus/. Substantive reasons for the swap (rationale block in the test-plan doc): ModbusPal 1.6b is abandoned (last release ~2019), Java GUI-only with no headless mode in the official JAR, and only exposes 2 of the 4 standard Modbus tables (holding_registers + coils — no input_registers, no discrete_inputs). pymodbus is current stable, pure Python CLI (pip install pymodbus[simulator]==3.13.0), exposes all four tables, has built-in declarative actions (increment / random / timestamp / uptime) for dynamic registers, supports custom Python actions for anything more complex, and ships an optional aiohttp-based web UI / REST API for live inspection. Pip-installable on Windows; sidesteps the privileged-port admin requirement by defaulting to TCP 5020.
ModbusSimulatorFixture default port bumped from 502 to 5020 to match the pymodbus convention. Override via MODBUS_SIM_ENDPOINT for a real PLC on its native 502. Skip-message updated to point at the new Pymodbus\serve.ps1 wrapper instead of 'start ModbusPal'. csproj <None Update> rule swapped from ModbusPal/** to Pymodbus/** so the new JSON profiles + serve.ps1 + README copy to test-output as PreserveNewest.
standard.json — generic Modbus TCP server, slave id 1, port 5020, shared blocks=false (independent coils + HR address spaces, more textbook-PLC-like). HR[0..31] seeded with address-as-value via per-register uint16 entries, HR[100] auto-increments via the built-in increment action with parameters minval=0/maxval=65535 (drives subscribe-and-receive integration tests so they have a register that ticks without a write — pymodbus's increment ticks per-access not wall-clock, which is good enough for a 250ms-poll test), HR[200..209] scratch range left at 0 for write tests, coils 0..31 alternating, coils 100..109 scratch. write list covers 0..1023 so any test address is mutable.
dl205.json — AutomationDirect DirectLOGIC DL205/DL260 quirk simulator, slave id 1, port 5020, shared blocks=true (matches DL series memory model where coils/DI/HR overlay the same word address space). Each quirky register seeded with the pre-computed raw uint16 value documented in docs/v2/dl205.md, with an inline _quirk JSON-comment naming the behavior so future-me reading the file knows why HR[1040]=25928 means 'H' lo / 'e' hi (the user's headline string-byte-order finding). Encoded quirks: V0 marker at HR[0]=0xCAFE; V2000 at HR[1024]=0x2000; V40400 at HR[8448]=0x4040; 'Hello' string at HR[1040..1042] first-char-low-byte; Float32 1.5f at HR[1056..1057] in CDAB word order (low word first); BCD register at HR[1072]=0x1234; FC03-128-cap block at HR[1280..1407]; Y0/C0 coil markers at 2048/3072; scratch C-relays at 4000..4007.
serve.ps1 wrapper — pwsh script with a -Profile {standard|dl205} parameter switch. Validates pymodbus.simulator is on PATH (clearer message than the raw CommandNotFoundException), validates the profile JSON exists, builds the right --modbus_server/--modbus_device/--json_file/--http_port arg list, and execs pymodbus.simulator in the foreground. -HttpPort 0 disables the web UI. Foreground exec lets the operator Ctrl+C to stop without an extra control script.
README.md fully rewritten for pymodbus: install command (pip install 'pymodbus[simulator]==3.13.0' — pinned for reproducibility, [simulator] extra pulls aiohttp), per-profile reference tables, the same DL205 quirk → register table from PR 42 but adjusted for pymodbus paths, what's-NEW-vs-ModbusPal section (all four tables, raw uint16 seeding, declarative actions, custom Python action modules, headless, web UI, maintained), trade-offs section (float32-as-two-uint16s for explicit CDAB control, increment ticks per-access not wall-clock, shared-blocks mode for DL205 vs separate for Standard), file-format quick reference for hand-authoring more profiles. References pinned to the pymodbus readthedocs simulator/config + REST API pages.
docs/v2/modbus-test-plan.md harness section rewritten with the swap rationale; PR-history list updated to mark PR 42 SUPERSEDED by PR 43 and call out PR 44+ as the per-quirk implementation track. Test-conventions bullet about 'don't depend on ModbusPal state between tests' generalized to 'don't depend on simulator state' and a note added that pymodbus's REST API can reset state between facts if a test ever needs it.
DL205Profile.cs and DL205SmokeTests.cs xml-doc updated to reference pymodbus / dl205.json instead of ModbusPal / DL205.xmpp.
Functional validation deferred — Python isn't installed on this dev box (winget search returned no matches for Python.Python.3 exact). JSON parses structurally (PowerShell ConvertFrom-Json clean on both files), build clean, .json + serve.ps1 + README all copy to test-output as expected. User installs pymodbus when they want to actually run the simulator end-to-end; if pymodbus rejects the config the README's reference link to pymodbus's simulator/config schema doc is the right next stop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 20:35:26 -04:00
Joseph Doherty
9de96554dc Phase 3 PR 41 — Document AutomationDirect DL205 / DL260 Modbus quirks. Adds docs/v2/dl205.md (~300 lines, 8 H2 sections, primary-source citations) covering every place the DL205/DL260 family diverges from textbook Modbus or has non-obvious behavior a generic client gets wrong. Replaces the placeholder _pending_ list in modbus-test-plan.md with a confirmed-behaviors table that doubles as the integration-test roadmap.
The user explicitly flagged that DL205/DL260 strings don't follow Modbus convention; research turned up that and a lot more. Headline findings:
String packing — TWO chars per V-memory register but the FIRST char is in the LOW byte (opposite of the big-endian Modbus convention generic drivers default to). 'Hello' in V2000 reads back as 'eHll o\0' on a textbook decoder. Kepware's DirectLogic driver exposes a per-tag 'String Byte Order = Low/High' toggle specifically for this; we'll need the same. Null-terminated, no length prefix, no dedicated KSTR address space — strings live wherever ladder allocates them in V-memory.
V-memory addressing — DirectLOGIC's native V-memory is OCTAL (V2000, V40400) but Modbus is decimal. The CPU translates: V2000 octal = decimal 1024 = Modbus PDU 0x0400. The widespread 'V40400 = register 0' shorthand is wrong on modern firmware (that was DL05/DL06 relative mode); on H2-ECOM100 absolute mode (factory default) V40400 = PDU 0x2100. We'd surface this with an address-format helper in the device profile so operators write V2000 instead of computing 1024 by hand.
Word order CDAB for all 32-bit values — DL205 and DL260 agree, ECOM modules don't re-swap. Already supported via ModbusByteOrder.WordSwap; just needs to be the default in the DL205 profile.
BCD-as-default numeric storage — bit one I didn't expect. DirectLOGIC stores 'V2000 = 1234' as 0x1234 on the wire (BCD nibbles), not as 0x04D2 (decimal 1234). IEEE 754 Float32 only works when ladder used the explicit R type (LDR/OUTR instructions). We need a new decoder mode for BCD-encoded registers — current code assumes binary integers.
FC quantity caps — FC03/04 cap at 128 (above spec's 125 — Bonus territory, current code already respects 125), FC16 caps at 100 (BELOW spec's 123 — important bulk-write batching gotcha). Quantity overrun returns exception 03 IllegalDataValue.
Coil/discrete mappings — DL260: X0->discrete input 0, Y0->coil 2048, C0->coil 3072. SP specials at discrete input 1024-1535 RO. These are CPU-wired constants and cannot be remapped; need to be hardcoded in the DL205/DL260 device profile.
Register 0 — accepted on DL205/DL260 with ECOM in absolute mode, contrary to the widespread internet claim that 'DirectLOGIC rejects register 0'. That rumour was an older DL05/DL06 relative-mode artefact. Our ModbusProbeOptions.ProbeAddress default of 0 is therefore safe for DL205/DL260.
Exception codes — only the standard 01-04. Write-to-protected-bit returns 02 on newer firmware, 04 on older (firmware-transition revision unconfirmed); driver should map both to BadNotWritable. No proprietary exception codes.
Behavioral oddities — H2-ECOM100 accepts MAX 4 simultaneous TCP connections (5th refused at TCP accept). No TCP keepalive (intermediate NAT/firewall drops idle sockets after 2-5 min — periodic probe required). No mid-stream resync on malformed MBAP — driver must reconnect + replay. TxId-drop-under-load forum rumour is unconfirmed; our single-flight + TxId-match guard handles it either way.
Each H2 section ends with the integration-test names we'd ship per the modbus-test-plan.md DL205_<behavior> convention — twelve named test slots ready for PR 42+ to fill in one at a time. References (8) cited inline, primarily D2-USER-M, HA-ECOM-M, and the Kepware DirectLogic Ethernet driver manual which documents these vendor quirks explicitly because they have to cope with them.
modbus-test-plan.md DL205 section rewritten as a priority-ordered table with three columns (quirk / driver impact / test name), pointing the reader at dl205.md for the full reference. Operator-reported items separated into a tail subsection so future-me knows which behaviors are documented vs reproduced-on-hardware.
Pure documentation PR — no code changes. The actual driver work (string-byte-order option, BCD decoder mode, V-memory address helper, FC16 cap-per-device-family, multi-client TCP handling) lands one PR per quirk in PR 42+ as ModbusPal validation completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:49:35 -04:00
Joseph Doherty
aa8834a231 Phase 3 PR 40 — LiveStackSmokeTests: write-roundtrip + subscribe-receives-OnDataChange against the live Galaxy. Finishes LMX #5 by exercising the IWritable + ISubscribable capability paths end-to-end through the Proxy → OtOpcUaGalaxyHost service → MXAccess → real Galaxy.
Two new facts target DelmiaReceiver_001.TestAttribute — the writable Boolean UDA on the TestMachine_001 hierarchy in this dev Galaxy. The user nominated TestMachine_001 (the deployed test-target object) as a scratch surface for live testing; ZB query showed DelmiaReceiver_001 carries one dynamic_attribute named TestAttribute (mx_data_type=1=Boolean, lock_type=0=writable, security_classification=1=Operate). Naming makes the intent obvious — the attribute exists for exactly this kind of integration testing — and Boolean keeps the assertions simple (invert, write, read back).
Write_then_read_roundtrips_a_writable_Boolean_attribute_on_TestMachine_001: reads the current value as the baseline (Galaxy may return Uncertain quality until the Engine has scanned the attribute at least once — we don't read into a typed bool until Status is Good), inverts it, writes via IWritable, then polls reads in a 5s loop until either the new value comes back or the budget expires. The scan-window poll (rather than a single read after a fixed delay) accommodates Galaxy's variable scan latency on a fresh service start. Restore-on-finally writes the original value back so re-running the test doesn't accumulate a flipped TestAttribute on the dev box (Galaxy holds UDA values across runs since they're deployed). Best-effort restore — swallows exceptions so a failure in restore doesn't mask the primary assertion.
Subscribe_fires_OnDataChange_with_initial_value_then_again_after_a_write: subscribes to the same attribute with a 250ms publishing interval, captures every OnDataChange notification onto a thread-safe ConcurrentQueue (MXAccess advisory fires on its own thread per Galaxy's COM apartment model — must not block it), waits up to 5s for the initial-value callback (per ISubscribable's contract: 'driver MAY fire OnDataChange immediately with the current value'), records the queue depth as a baseline, writes the toggled value, waits up to 8s for at least one MORE notification, then searches the queue tail for the notification carrying the toggled value (initial value may appear multiple times before the write commits — looking at the tail finds the post-write delta even if the queue grew during the wait window). Unsubscribes on finally + restores baseline.
Both tests use Convert.ToBoolean(value ?? false) to defensively handle the Boxed-vs-typed quirk in MessagePack-deserialized Galaxy values — depending on the wire encoding the Boolean might come back as System.Boolean or System.Object boxing one. Convert.ToBoolean handles both. Same pattern in OnReadValue's existing usage.
WaitForAsync helper does the loop+budget pattern shared by both tests.
PR 40 is the code side of LMX #5's final two deferred facts. To actually run them green requires re-executing from a normal (non-admin) PowerShell — the elevated-shell skip from PR 39 fires correctly under bash + sc.exe-context (verified). lmx-followups.md #5 updated to note the new facts + the run command + the one remaining genuine follow-up (alarm-condition fact when an alarm-flagged attribute is deployed on TestMachine_001).
Test posture from elevated bash: 7 LiveStackSmokeTests facts discovered (was 5; +2 new), all skip cleanly with the elevation message. Build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 19:38:34 -04:00
Joseph Doherty
52a29100b1 Phase 3 PR 38 — DriverNodeManager HistoryRead override (LMX #1 finish). Wires the OPC UA HistoryRead service through CustomNodeManager2's four protected per-kind hooks — HistoryReadRawModified / HistoryReadProcessed / HistoryReadAtTime / HistoryReadEvents — each dispatching to the driver's IHistoryProvider capability (PR 35 for ReadAtTime + ReadEvents on top of PR 19-era ReadRaw + ReadProcessed). Was the last missing piece of the end-to-end HistoryRead path: PR 10 + PR 11 shipped the Galaxy.Host IPC contracts, PR 35 surfaced them on IHistoryProvider + GalaxyProxyDriver, but no server-side handler bridged OPC UA HistoryRead service requests onto the capability interface. Now it does.
Per-kind override shape: each hook receives the pre-filtered nodesToProcess list (NodeHandles for nodes this manager claimed), iterates them, resolves handle.NodeId.Identifier to the driver-side full reference string, and dispatches to the right IHistoryProvider method. Write back into the outer results + errors slots at handle.Index (not the local loop counter — nodesToProcess is a filtered subset of nodesToRead, so indexing by the loop counter lands in the wrong slot for mixed-manager batches). WriteResult helper sets both results[i] AND errors[i]; this matters because MasterNodeManager merges them and leaving errors[i] at its default (BadHistoryOperationUnsupported) overrides a Good result with Unsupported on the wire — this was the subtle failure mode that masked a correctly-constructed HistoryData response during debugging. Failure-isolation per node: NotSupportedException from a driver that doesn't implement a particular HistoryProvider method translates to BadHistoryOperationUnsupported in that slot; generic exceptions log and surface BadInternalError; unresolvable NodeIds get BadNodeIdUnknown. The batch continues unconditionally.
Aggregate mapping: MapAggregate translates ObjectIds.AggregateFunction_Average / Minimum / Maximum / Total / Count to the driver's HistoryAggregateType enum. Null for anything else (e.g. TimeAverage, Interpolative) so the handler surfaces BadAggregateNotSupported at the batch level — per Part 13, one unsupported aggregate means the whole request fails since ReadProcessedDetails carries one aggregate list for all nodes. BuildHistoryData wraps driver DataValueSnapshots as Opc.Ua.HistoryData in an ExtensionObject; BuildHistoryEvent wraps HistoricalEvents as Opc.Ua.HistoryEvent with the canonical BaseEventType field list (EventId, SourceName, Message, Severity, Time, ReceiveTime — the order OPC UA clients that didn't customize the SelectClause expect). ToDataValue preserves null SourceTimestamp (Galaxy historian rows often carry only ServerTimestamp) — synthesizing a SourceTimestamp would lie about actual sample time.
Two address-space changes were required to make the stack dispatch reach the per-kind hooks at all: (1) historized variables get AccessLevels.HistoryRead added to their AccessLevel byte — the base's early-gate check on (variable.AccessLevel & HistoryRead != 0) was rejecting requests before our override ever ran; (2) the driver-root folder gets EventNotifiers.HistoryRead | SubscribeToEvents so HistoryReadEvents can target it (the conventional pattern for alarm-history browse against a driver-owned object). Document the 'set both bits' requirement inline since it's not obvious from the surface API.
OpcHistoryReadResult alias: Opc.Ua.HistoryReadResult (service-layer per-node result) collides with Core.Abstractions.HistoryReadResult (driver-side samples + continuation point) by type name; the alias 'using OpcHistoryReadResult = Opc.Ua.HistoryReadResult' keeps the override signatures unambiguous and the test project applies the mirror pattern for its stub driver impl.
Tests — DriverNodeManagerHistoryMappingTests (12 new Category=Unit cases): MapAggregate translates each supported aggregate NodeId via reflection-backed theory (guards against the stack renaming AggregateFunction_* constants); returns null for unsupported NodeIds (TimeAverage) and null input; BuildHistoryData wraps samples with correct DataValues + SourceTimestamp preservation; BuildHistoryEvent emits the 6-element BaseEventType field list in canonical order (regression guard for a future 'respect the client's SelectClauses' change); null SourceName / Message translate to empty-string Variants (nullable-Variant refactor trap); ToDataValue preserves StatusCode + both timestamps; ToDataValue leaves SourceTimestamp at default when the snapshot omits it. HistoryReadIntegrationTests (5 new Category=Integration): drives a real OPC UA client Session.HistoryRead against a fake HistoryDriver through the running server. Covers raw round-trip (verifies per-node DataValue ordering + values); processed with Average aggregate (captures the driver's received aggregate + interval, asserting MapAggregate routed correctly); unsupported aggregate (TimeAverage → BadAggregateNotSupported); at-time (forwards the per-timestamp list); events (BaseEventType field list shape, SelectClauses populated to satisfy the stack's filter validator). Server.Tests Unit: 55 pass / 0 fail (43 prior + 12 new mapping). Server.Tests Integration: 14 pass / 0 fail (9 prior + 5 new history). Full solution build clean, 0 errors.
lmx-followups.md #1 updated to 'DONE (PRs 35 + 38)' with two explicit deferred items: continuation-point plumbing (driver returns null today so pass-through is fine) and per-SelectClause evaluation in HistoryReadEvents (clients with custom field selections get the canonical BaseEventType layout today).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 17:50:23 -04:00
Joseph Doherty
8adc8f5ab8 Phase 3 PR 37 — End-to-end live-stack Galaxy smoke test. Closes the code side of LMX follow-up #5; once OtOpcUaGalaxyHost is installed + started on the dev box, the suite exercises the full topology GalaxyProxyDriver in-process → named-pipe IPC → running OtOpcUaGalaxyHost Windows service → MxAccessGalaxyBackend → live MXAccess runtime → real deployed Galaxy objects. Never spawns the Host process itself — connects to the already-running service per project_galaxy_host_service.md, which is the only way to exercise the production COM-apartment + service-account + pipe-ACL configuration.
LiveStackConfig resolves the pipe name + per-install shared secret from two sources in order: OTOPCUA_GALAXY_PIPE + OTOPCUA_GALAXY_SECRET env vars first (for CI / benchwork overrides), then the service's per-process Environment registry values under HKLM\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost (what Install-Services.ps1 writes at install time). Registry read requires the test host to run elevated on most boxes — the skip message says so explicitly so operators see the right remediation. Hard-coded secrets are deliberately avoided: the installer generates 32 fresh random bytes per install, a committed secret would diverge from production the moment the service is re-installed.
LiveStackFixture is an IAsyncLifetime that (1) runs AvevaPrerequisites.CheckAllAsync with CheckGalaxyHostPipe=true + CheckHistorian=false — produces a structured PrerequisiteReport whose SkipReason is the exact operator-facing 'here's what you need to fix' text, (2) resolves LiveStackConfig and surfaces a clear skip when the secret isn't discoverable, (3) instantiates GalaxyProxyDriver + calls InitializeAsync (the IPC handshake), capturing a skip with the exception detail + common-cause hints (secret mismatch, SID not in pipe ACL, Host's backend couldn't connect to ZB) rather than letting a NullRef cascade through every subsequent test. SkipIfUnavailable() translates the captured SkipReason into Assert.Skip at the top of every fact so tests read as cleanly-skipped with a visible reason, not silently-passed or crashed.
LiveStackSmokeTests (5 facts, Collection=LiveStack, Category=LiveGalaxy): Fixture_initialized_successfully (cheapest possible end-to-end assertion — if this passes, the IPC handshake worked); Driver_reports_Healthy_after_IPC_handshake (DriverHealth.State post-connect); DiscoverAsync_returns_at_least_one_variable_from_live_galaxy (captures every Variable() call from DiscoverAsync via CapturingAddressSpaceBuilder and asserts > 0 — zero here usually means the Host couldn't read ZB, the skip message names OTOPCUA_GALAXY_ZB_CONN to check); GetHostStatuses_reports_at_least_one_platform (IHostConnectivityProbe surface — zero means the probe loop hasn't fired or no Platform is deployed locally); Can_read_a_discovered_variable_from_live_galaxy (reads the first discovered attribute's full reference, asserts status != BadInternalError — Galaxy's Uncertain-quality-until-first-Engine-scan is intentionally NOT treated as failure since it depends on runtime state that varies across test runs). Read-only by design; writes need an agreed scratch tag to avoid mutating a process-critical attribute — deferred to a follow-up PR that reuses this fixture.
CapturingAddressSpaceBuilder is a minimal IAddressSpaceBuilder that flattens every Variable() call into a list so tests can inspect what discovery produced without booting the full OPC UA node-manager stack; alarm annotation + property calls are no-ops. Scoped private to the test class.
Galaxy.Proxy.Tests csproj gains a ProjectReference to Driver.Galaxy.TestSupport (PR 36) for AvevaPrerequisites. The NU1702 warning about the Host project being net48-referenced-by-net10 is pre-existing from the HostSubprocessParityTests — Proxy.Tests only needs the Host EXE path for that parity scenario, not type surface.
Test run on THIS machine (OtOpcUaGalaxyHost not yet installed): Skipped! Failed 0, Passed 0, Skipped 5 — each skip message includes the full prerequisites report pointing at the missing service. Once the service is installed + started (scripts\install\Install-Services.ps1), the 5 facts will execute against live Galaxy. Proxy.Tests Unit: 17 pass / 0 fail (unchanged — new tests are Category=LiveGalaxy, separate suite). Full Proxy build clean. Memory already captures the 'live tests run via already-running service, don't spawn' convention (project_galaxy_host_service.md).
lmx-followups.md #5 updated: status is 'IN PROGRESS' across PRs 36 + 37 with the explicit remaining work (install + start services, subscribe-and-receive, write round-trip).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:49:51 -04:00
Joseph Doherty
bf329b05d8 Phase 3 PR 35 — IHistoryProvider gains ReadAtTimeAsync + ReadEventsAsync; GalaxyProxyDriver implements both. Extends Core.Abstractions.IHistoryProvider with two new methods that round out the OPC UA Part 11 HistoryRead surface (HistoryReadAtTime + HistoryReadEvents are the last two modes not covered by the PR 19-era ReadRawAsync + ReadProcessedAsync) and wires GalaxyProxyDriver to call the existing PR-10/PR-11 IPC contracts the Host already implements.
Interface additions use C# default interface implementations that throw NotSupportedException — existing IHistoryProvider implementations keep compiling, only drivers whose backend carries the relevant capability override. This matches the 'capabilities are optional per driver' design already used by IHistoryProvider.ReadProcessedAsync's docs (Modbus / OPC UA Client drivers never had an event historian and the default-throw path lets callers see BadHistoryOperationUnsupported naturally). New HistoricalEvent record models one historian row (EventId, SourceName, EventTimeUtc + ReceivedTimeUtc — process vs historian-persist timestamps, Message, Severity mapped to OPC UA's 1-1000 range); HistoricalEventsResult pairs the event list with a continuation-point token for future batching. Both live in Core.Abstractions so downstream (Proxy, Host, Server) reference a single domain shape — no Shared-contract leak into the driver-facing interface.
GalaxyProxyDriver.ReadAtTimeAsync maps the domain DateTime[] to Unix-ms longs, calls CallAsync on the existing MessageKind.HistoryReadAtTimeRequest, and trusts the Host's one-sample-per-requested-timestamp contract (the Host pads with bad-quality snapshots for timestamps it can't interpolate; re-aligning on the Proxy side would duplicate the Host's interpolation policy logic). ReadEventsAsync does the same for HistoryReadEventsRequest; ToHistoricalEvent translates GalaxyHistoricalEvent (MessagePack-annotated, Unix-ms) to the domain record, explicitly tagging DateTimeKind.Utc on both timestamp fields so downstream serializers (JSON, OPC UA types) don't apply an unexpected local-time offset.
Tests — HistoricalEventMappingTests (3 new Proxy.Tests unit cases): every field maps correctly from wire to domain; null SourceName and null DisplayText preserve through the mapping (system events without a source come out with null so callers can distinguish them from alarm events); both timestamps come out as DateTimeKind.Utc (regression guard against a future refactor using DateTime.FromFileTimeUtc or similar that defaults to Unspecified). Driver.Galaxy.Proxy.Tests Unit suite: 17 pass / 0 fail (14 prior + 3 new). Full solution build clean, 0 errors.
Scope exclusions — DriverNodeManager HistoryRead service-handler wiring (on the OPC UA Server side, where HistoryReadAtTime and HistoryReadEvents service requests land) and the full-loop integration test (OPC UA client → server → IPC → Host → HistorianDataSource → back) are deferred to a focused follow-up PR. The capability surface is the load-bearing change; wiring the service handlers is mechanical in comparison and worth its own PR for reviewability. docs/v2/lmx-followups.md #1 updated with the split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:08:27 -04:00
Joseph Doherty
ef2a810b2d Phase 3 PR 34 — Host-status publisher (Server) + /hosts drill-down page (Admin). Closes LMX follow-up #7 by wiring together the data layer from PR 33. Server.HostStatusPublisher is a BackgroundService that walks every driver registered in DriverHost every 10 seconds, skips drivers that don't implement IHostConnectivityProbe, calls GetHostStatuses() on each probe-capable driver, and upserts one DriverHostStatus row per (NodeId, DriverInstanceId, HostName) into the central config DB. Upsert path: SingleOrDefaultAsync on the composite PK; if no row exists, Add a new one; if a row exists, LastSeenUtc advances unconditionally (heartbeat) and State + StateChangedUtc update only on transitions so Admin UI can distinguish 'still reporting, still Running' from 'freshly transitioned to Running'. MapState translates Core.Abstractions.HostState to Configuration.Enums.DriverHostState (intentional duplicate enum — Configuration project stays free of driver-runtime deps per PR 33's choice). If a driver's GetHostStatuses throws, log warning and skip that driver this tick — never take down the Server on a publisher failure. If the DB is unreachable, log warning + retry next heartbeat (no buffering — next tick's current-state snapshot is more useful than replaying stale transitions after a long outage). 2-second startup delay so NodeBootstrap's RegisterAsync calls land before the first publish tick, then tick runs immediately so a freshly-started Server surfaces its host topology in the Admin UI without waiting a full interval.
Polling chosen over event-driven for initial scope: simpler, matches Admin UI consumer cadence, avoids DriverHost lifecycle-event plumbing that doesn't exist today. Event-driven push for sub-heartbeat latency is a straightforward follow-up.
Admin.Services.HostStatusService left-joins DriverHostStatus against ClusterNode on NodeId so rows persist even when the ClusterNode entry doesn't exist yet (first-boot bootstrap case). StaleThreshold = 30s — covers one missed publisher heartbeat plus a generous buffer for clock skew and GC pauses. Admin Components/Pages/Hosts.razor — FleetAdmin-visible page grouped by cluster (handles the '(unassigned)' case for rows without a matching ClusterNode). Four summary cards (Hosts / Running / Stale / Faulted); per-cluster table with Node / Driver / Host / State + Stale-badge / Last-transition / Last-seen / Detail columns; 10s auto-refresh via IServiceScopeFactory timer pattern matching FleetStatusPoller + Fleet dashboard (PR 27). Row-class highlighting: Faulted → table-danger, Stale → table-warning, else default. State badge maps DriverHostState enum to bootstrap color classes. Sidebar link added between 'Fleet status' and 'Clusters'.
Server csproj adds Microsoft.EntityFrameworkCore.SqlServer 10.0.0 + registers OtOpcUaConfigDbContext in Program.cs scoped via NodeOptions.ConfigDbConnectionString (no Admin-style manual SQL raw — the DbContext is the only access path, keeps migrations owner-of-record).
Tests — HostStatusPublisherTests (4 new Integration cases, uses per-run throwaway DB matching the FleetStatusPollerTests pattern): publisher upserts one row per host from each probe-capable driver and skips non-probe drivers; second tick advances LastSeenUtc without creating duplicate rows (upsert pattern verified end-to-end); state change between ticks updates State AND StateChangedUtc (datetime2(3) rounds to millisecond precision so comparison uses 1ms tolerance — documented inline); MapState translates every HostState enum member. Server.Tests Integration: 4 new tests pass. Admin build clean, Admin.Tests Unit still 23 / 0. docs/v2/lmx-followups.md item #7 marked DONE with three explicit deferred items (event-driven push, failure-count column, SignalR fan-out).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:51:55 -04:00
Joseph Doherty
2f00c74bbb Phase 3 PR 32 — Multi-driver integration test. Closes LMX follow-up #6 with Server.Tests/MultipleDriverInstancesIntegrationTests.cs: registers two StubDriver instances (alpha + beta) with distinct DriverInstanceIds on one DriverHost, boots the full OpcUaApplicationHost, and exercises three behaviors end-to-end via a real OPC UA client session. (1) Each driver's namespace URI resolves to a distinct index in the client's NamespaceUris (alpha → urn:OtOpcUa:alpha, beta → urn:OtOpcUa:beta) — proves DriverNodeManager's namespaceUris-per-driver base-ctor wiring actually lands two separate INodeManager registrations. (2) Browsing one subtree returns only that driver's folder; the other driver's folder does NOT leak into the wrong subtree. This is the test that catches a cross-driver routing regression the v1 single-driver code path couldn't surface — if a future refactor flattens both drivers into a shared namespace, the 'shouldNotContain' assertion fails cleanly. (3) Reads route to the owning driver by namespace — alpha's ReadAsync returns 42 while beta's returns 99; a misroute would surface as 99 showing up on an alpha node id or vice versa. StubDriver is parameterized on (DriverInstanceId, folderName, readValue) so the same class constructs both instances without copy-paste.
No production code changes — pure additive test. Server.Tests Integration: 3 new tests pass; existing OpcUaServerIntegrationTests stays green (single-driver case still exercised there). Full Server.Tests Unit still 43 / 0. Deferred: multi-driver alarm-event case (two drivers each raising a GalaxyAlarmEvent, assert each condition lands on its owning instance's condition node) — needs a stub IAlarmSource and is worth its own focused PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:29:49 -04:00
Joseph Doherty
4886a5783f Phase 3 PR 31 — Live-LDAP integration test + Active Directory compatibility. Closes LMX follow-up #4 with 6 live-bind tests in Server.Tests/LdapUserAuthenticatorLiveTests.cs against the dev GLAuth instance at localhost:3893 (skipped cleanly when unreachable via Assert.Skip + a clear SkipReason — matches the GalaxyRepositoryLiveSmokeTests pattern). Coverage: valid credentials bind + surface DisplayName; wrong password fails; unknown user fails; empty credentials fail pre-flight without touching the directory; writeop user's memberOf maps through GroupToRole to WriteOperate (the exact string WriteAuthzPolicy.IsAllowed expects); admin user surfaces all four mapped roles (WriteOperate + WriteTune + WriteConfigure + AlarmAck) proving memberOf parsing doesn't stop after the first match. While wiring this up, the authenticator's hard-coded user-lookup filter 'uid=<name>' didn't match GLAuth (which keys users by cn and doesn't populate uid) — AND it doesn't match Active Directory either, which uses sAMAccountName. Added UserNameAttribute to LdapOptions (default 'uid' for RFC 2307 backcompat) so deployments override to 'cn' / 'sAMAccountName' / 'userPrincipalName' as the directory requires; authenticator filter now interpolates the configured attribute. The default stays 'uid' so existing test fixtures and OpenLDAP installs keep working without a config change — a regression guard in LdapUserAuthenticatorAdCompatTests.LdapOptions_default_UserNameAttribute_is_uid_for_rfc2307_compat pins this so a future 'helpful' default change can't silently break anyone.
Active Directory compatibility. LdapOptions xml-doc expanded with a cheat-sheet covering Server (DC FQDN), Port 389 vs 636, UseTls=true under AD LDAP-signing enforcement, dedicated read-only service account DN, sAMAccountName vs userPrincipalName vs cn trade-offs, memberOf DN shape (CN=Group,OU=...,DC=... with the CN= RDN stripped to become the GroupToRole key), and the explicit 'nested groups NOT expanded' call-out (LDAP_MATCHING_RULE_IN_CHAIN / tokenGroups is a future authenticator enhancement, not a config change). docs/security.md §'Active Directory configuration' adds a complete appsettings.json snippet with realistic AD group names (OPCUA-Operators → WriteOperate, OPCUA-Engineers → WriteConfigure, OPCUA-AlarmAck → AlarmAck, OPCUA-Tuners → WriteTune), LDAPS port 636, TLS on, insecure-LDAP off, and operator-facing notes on each field. LdapUserAuthenticatorAdCompatTests (5 unit guards): ExtractFirstRdnValue parses AD-style 'CN=OPCUA-Operators,OU=...,DC=...' DNs correctly (case-preserving — operators' GroupToRole keys stay readable); also handles mixed case and spaces in group names ('Domain Users'); also works against the OpenLDAP ou=<group>,ou=groups shape (GLAuth) so one extractor tolerates both memberOf formats common in the field; EscapeLdapFilter escapes the RFC 4515 injection set (\, *, (, ), \0) so a malicious login like 'admin)(cn=*' can't break out of the filter; default UserNameAttribute regression guard.
Test posture — Server.Tests Unit: 43 pass / 0 fail (38 prior + 5 new AD-compat guards). Server.Tests LiveLdap category: 6 pass / 0 fail against running GLAuth (would skip cleanly without). Server build clean, 0 errors, 0 warnings.
Deferred: the session-identity end-to-end check (drive a full OPC UA UserName session, then read a 'whoami' node to verify the role landed on RoleBasedIdentity). That needs a test-only address-space node and is scoped for a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:23:22 -04:00
Joseph Doherty
cb7b81a87a Phase 3 PR 30 — Modbus integration-test project scaffold. New tests/ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests project is the harness modbus-test-plan.md called for: a skip-when-unreachable fixture that TCP-probes a Modbus simulator endpoint (MODBUS_SIM_ENDPOINT, default localhost:502) once per test session, a DL205 device profile stub (single writable holding register at address 100, probe disabled to avoid racing with assertions), and one happy-path smoke test that initializes the real ModbusDriver + real ModbusTcpTransport, writes a known Int16 value, reads it back, and asserts status=0 + value round-trip. No DL205 quirk assertions yet — those land one-per-PR as the user validates each behavior in ModbusPal (word order for 32-bit, register-zero access, coil addressing base, max registers per FC03, response framing under load, exception code on protected-bit coil write).
ModbusSimulatorFixture is a collection fixture so the 2s TCP probe runs once per run, not per test; SkipReason gets a clear operator-facing message ('start ModbusPal or override MODBUS_SIM_ENDPOINT'). Tests call Assert.Skip(sim.SkipReason) rather than silently returning — matches the test-plan convention and reads cleanly in CI logs. DL205Profile.BuildOptions deliberately disables the background probe loop since integration tests drive reads explicitly and the probe would race with assertions. Tag naming uses the DL205_ prefix so filter 'DisplayName~DL205' surfaces device-specific failures at a glance.
Project references: xunit.v3 + Shouldly + Microsoft.NET.Test.Sdk + xunit.runner.visualstudio (matches the existing Driver.Modbus.Tests unit project), project ref to src/Driver.Modbus. Registered in ZB.MOM.WW.OtOpcUa.slnx under tests/. ModbusPal/README.md documents the dev loop (install ModbusPal jar, load profile, start simulator, dotnet test), explains MODBUS_SIM_ENDPOINT override for real-PLC benchwork, and flags DL205.xmpp as the first profile to add in a follow-up PR.
dotnet test run against the scaffold (no simulator running) skips cleanly: 0 failed, 0 passed, 1 skipped, with the SkipReason surfaced. dotnet build clean (0 warnings, 0 errors). Updated docs/v2/modbus-test-plan.md to mark the scaffold PR done and renumbered future PRs from 'PR 27+' to 'PR 31+' to stay in sync with the actual PR chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:02:39 -04:00
Joseph Doherty
ed88835d34 Phase 3 PR 28 — Admin UI cert-trust management page. New /certificates route (FleetAdmin-only) surfaces the OPC UA server's PKI store rejected + trusted certs and gives operators Trust / Delete / Revoke actions so rejected client certs can be promoted without touching disk. CertTrustService reads $PkiStoreRoot/{rejected,trusted}/certs/*.der files directly via X509CertificateLoader — no Opc.Ua dependency in the Admin project, which keeps the Admin host runnable on a machine that doesn't have the full Server install locally (only needs the shared PKI directory reachable; typical deployment has Admin + Server side-by-side on the same box and PkiStoreRoot defaults match so a plain-vanilla install needs no override). CertTrustOptions bound from the Admin's 'CertTrust:PkiStoreRoot' section, default %ProgramData%\OtOpcUa\pki (matches OpcUaServerOptions.PkiStoreRoot default). Trust action moves the .der from rejected/certs/ to trusted/certs/ via File.Move(overwrite:true) — idempotent, tolerates a concurrent operator doing the same move. Delete wipes the file. Revoke removes from trusted/certs/ (Opc.Ua re-reads the Directory store on each new client handshake, so no explicit reload signal is needed; operators retry the rejected connection after trusting). Thumbprint matching is case-insensitive because X509Certificate2.Thumbprint is upper-case hex but operators copy-paste from logs that sometimes lowercase it. Malformed files in the store are logged + skipped — a single bad .der can't take the whole management page offline. Missing store directories produce empty lists rather than exceptions so a pristine install (Server never run yet, no rejected/trusted dirs yet) doesn't crash the page.
Razor page layout: two tables (Rejected / Trusted) with Subject / Issuer / Thumbprint / Valid-window / Actions columns, status banner after each action with success or warning kind ('file missing' = another admin handled it), FleetAdmin-only via [Authorize(Roles=AdminRoles.FleetAdmin)]. Each action invokes LogActionAsync which Serilog-logs the authenticated admin user + thumbprint + action for an audit trail — DB-level ConfigAuditLog persistence is deferred because its schema is cluster-scoped and cert actions are cluster-agnostic; Serilog + CertTrustService's filesystem-op info logs give the forensic trail in the meantime. Sidebar link added to MainLayout between Reservations and the future Account page.
Tests — CertTrustServiceTests (9 new unit cases): ListRejected parses Subject + Thumbprint + store kind from a self-signed test cert written into rejected/certs/; rejected and trusted stores are kept separate; TrustRejected moves the file and the Rejected list is empty afterwards; TrustRejected with a thumbprint not in rejected returns false without touching trusted; DeleteRejected removes the file; UntrustCert removes from trusted only; thumbprint match is case-insensitive (operator UX); missing store directories produce empty lists instead of throwing DirectoryNotFoundException (pristine-install tolerance); a junk .der in the store is logged + skipped and the valid certs still surface (one bad file doesn't break the page). Full Admin.Tests Unit suite: 23 pass / 0 fail (14 prior + 9 new). Full Admin build clean — 0 errors, 0 warnings.
lmx-followups.md #3 marked DONE with a cross-reference to this PR and a note that flipping AutoAcceptUntrustedClientCertificates to false as the production default is a deployment-config follow-up, not a code gap — the Admin UI is now ready to be the trust gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:37:55 -04:00
Joseph Doherty
6b04a85f86 Phase 3 PR 26 — server-layer write authorization gating by role. Per the user's ACL-at-server-layer directive (saved as feedback_acl_at_server_layer.md in memory), write authorization is enforced in DriverNodeManager.OnWriteValue and never delegated to the driver or to driver-specific auth (the v1 Galaxy-provided security path is explicitly not part of v2 — drivers report SecurityClassification as discovery metadata only). New WriteAuthzPolicy static class in Server/Security/ maps SecurityClassification → required role per the table documented in docs/Configuration.md: FreeAccess = no role required (anonymous sessions can write), Operate + SecuredWrite = WriteOperate, Tune = WriteTune, VerifiedWrite + Configure = WriteConfigure, ViewOnly = deny regardless of roles. Role matching is case-insensitive and role requirements do NOT cascade — a session with WriteConfigure can write Configure attributes but needs WriteOperate separately to write Operate attributes; this is deliberate so escalation is an explicit LDAP group assignment, not a hierarchy the policy silently grants. DriverNodeManager gains a _securityByFullRef Dictionary populated during Variable() registration (parallel to the existing _variablesByFullRef) so OnWriteValue can look up the classification in O(1) on the hot path. OnWriteValue casts the session's context.UserIdentity to the new IRoleBearer interface (implemented by OtOpcUaServer.RoleBasedIdentity from PR 19) — empty Roles collection when the session is anonymous; the same WriteAuthzPolicy.IsAllowed check then either short-circuits true (FreeAccess), false (ViewOnly), or walks the roles list looking for the required one. On deny, OnWriteValue logs 'Write denied for {FullRef}: classification=X userRoles=[...]' at Information level (readable trail for operator complaints) and returns BadUserAccessDenied without touching IWritable.WriteAsync — drivers never see a request we'd have refused. IRoleBearer kept as a minimal server-side interface rather than reusing some abstraction from Core.Abstractions because the concept is OPC-UA-session-scoped and doesn't generalize (the driver side has no notion of a user session). Tests — WriteAuthzPolicyTests (17 new cases): FreeAccess allows write with empty role set + arbitrary roles; ViewOnly denies write even with every role; Operate requires WriteOperate; role match is case-insensitive; Operate denies empty role set + wrong role; SecuredWrite shares Operate's requirement; Tune requires WriteTune; Tune denies WriteOperate-only (asserts roles don't cascade — this is the test that catches a future regression where someone 'helpfully' adds a role-escalation table); Configure requires WriteConfigure; VerifiedWrite shares Configure's requirement; multi-role session allowed when any role matches; unrelated roles denied; RequiredRole theory covering all 5 classified-and-mapped rows + null for FreeAccess/ViewOnly special cases. lmx-followups.md follow-up #2 marked DONE with a back-reference to this PR and the memory note. Full Server.Tests Unit suite: 38 pass / 0 fail (17 new WriteAuthz + 14 SecurityConfiguration from PR 19 + 2 NodeBootstrap + 5 others). Server.Tests Integration (Category=Integration) 2 pass — existing PR 17 anonymous-endpoint smoke tests stay green since the read path doesn't hit OnWriteValue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:01:01 -04:00
Joseph Doherty
77d09bf64e Phase 3 PR 25 — modbus-test-plan.md: integration-test playbook with per-device quirk catalog. ModbusPal is the chosen simulator; AutomationDirect DL205 is the first target device class with 6 pending quirks to document and cover with named tests (word order for 32-bit values, register-zero access policy, coil addressing base, maximum registers per FC03, response framing under sustained load, exception code on protected-bit coil write). Each quirk placeholder has a proposed test name so the user's validation work translates directly into integration tests. Test conventions section codifies the named-per-quirk pattern, skip-when-unreachable guard, real ModbusTcpTransport usage, and inter-test isolation. Sets up the harness-and-catalog structure future device families (Allen-Bradley Micrologix, Siemens S7-1200 Modbus gateway, Schneider M340, whatever the user hits) will slot into — same per-device catalog shape, cross-device patterns section for recurring quirks that can get promoted into driver defaults. Next concrete PRs proposed: PR 26 for the integration test project scaffold + DL205 profile + fixture with skip-guard + one smoke test, PR 27+ for the individual confirmed quirks one-per-PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:45:21 -04:00
Joseph Doherty
860deb8e0d Phase 3 PR 20 — lmx-followups.md: track remaining Galaxy-bridge tasks after PR 19 (HistoryReadAtTime/Events Proxy wiring, write-gating by role, Admin UI cert trust, live-LDAP integration test, full-stack Galaxy smoke, multi-driver test, per-host dashboard). Documents what each item depends on, the shipped surface it builds on, and the minimal to-do so a future session can pick any one off in isolation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:43:15 -04:00
Joseph Doherty
caa9cb86f6 Phase 2 PR 4 — close the 4 open high/medium MXAccess findings from exit-gate-phase-2-final.md. High 1 (ReadAsync subscription-leak on cancel): the one-shot read now wraps subscribe→first-OnDataChange→unsubscribe in try/finally so the per-tag callback is always detached, and if the read installed the underlying MXAccess subscription itself (the prior _addressToHandle key was absent) it tears it down on the way out — no leaked probe item handles when the caller cancels or times out. High 2 (no reconnect loop): MxAccessClient gets a MxAccessClientOptions {AutoReconnect, MonitorInterval=5s, StaleThreshold=60s} + a background MonitorLoopAsync started at first ConnectAsync. The loop wakes every MonitorInterval, checks _lastObservedActivityUtc (bumped by every OnDataChange callback), and if stale probes the proxy with a no-op COM AddItem("$Heartbeat") on the StaPump; if the probe throws or returns false, the loop reconnects-with-replay — Unregister (best-effort), Register, snapshot _addressToHandle.Keys + clear, re-AddItem every previously-active subscription, ConnectionStateChanged events fire for the false→true transition, ReconnectCount bumps. Medium 3 (subscriptions don't push frames back to Proxy): IGalaxyBackend gains OnDataChange/OnAlarmEvent/OnHostStatusChanged events; new IFrameHandler.AttachConnection(FrameWriter) is called per-connection by PipeServer after Hello + the returned IDisposable disposes at connection close; GalaxyFrameHandler.ConnectionSink subscribes the events for the connection lifetime, fire-and-forget pushes them as MessageKind.OnDataChangeNotification / AlarmEvent / RuntimeStatusChange frames through the writer, swallows ObjectDisposedException for the dispose race, and unsubscribes in Dispose to prevent leaked invocation list refs across reconnects. MxAccessGalaxyBackend's existing SubscribeAsync (which previously discarded values via a (_, __) => {} callback) now wires OnTagValueChanged that fans out per-tag value changes to every subscription ID listening (one MXAccess subscription, multi-fan-out — _refToSubs reverse map). UnsubscribeAsync also reverse-walks the map to only call mx.UnsubscribeAsync when the LAST sub for a tag drops. Stub + DbBacked backends declare the events with #pragma warning disable CS0067 because they never raise them but must satisfy the interface (treat-warnings-as-errors would otherwise fail). Medium 4 (WriteValuesAsync doesn't await OnWriteComplete): MxAccessClient.WriteAsync rewritten to return Task<bool> via the v1-style TaskCompletionSource-keyed-by-item-handle pattern in _pendingWrites — adds the TCS before the Write call, awaits it with a configurable timeout (default 5s), removes the TCS in finally, returns true only when OnWriteComplete reported success. MxAccessGalaxyBackend.WriteValuesAsync now reports per-tag Bad_InternalError ("MXAccess runtime reported write failure") when the bool returns false, instead of false-positive Good. PipeServer's IFrameHandler interface adds the AttachConnection(FrameWriter):IDisposable method + a public NoopAttachment nested class (net48 doesn't support default interface methods so the empty-attach is exposed for stub implementations). StubFrameHandler returns IFrameHandler.NoopAttachment.Instance. RunOneConnectionAsync calls AttachConnection after HelloAck and usings the returned disposable so it disposes at the connection scope's finally. ConnectionStateChanged event added on MxAccessClient (caller-facing diagnostics for false→true reconnect transitions). docs/v2/implementation/pr-4-body.md is the Gitea web-UI paste-in for opening PR 4 once pushed; includes 2 new low-priority adversarial findings (probe item-handle leak; replay-loop silently swallows per-subscription failures) flagged as follow-ups not PR 4 blockers. Full solution 460 pass / 7 skip (E2E on admin shell) / 1 pre-existing Phase 0 baseline. No regressions vs PR 2's baseline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 01:12:09 -04:00
Joseph Doherty
a3d16a28f1 Phase 2 Stream D Option B — archive v1 surface + new Driver.Galaxy.E2E parity suite. Non-destructive intermediate state: the v1 OtOpcUa.Host + Historian.Aveva + Tests + IntegrationTests projects all still build (494 v1 unit + 6 v1 integration tests still pass when run explicitly), but solution-level dotnet test ZB.MOM.WW.OtOpcUa.slnx now skips them via IsTestProject=false on the test projects + archive-status PropertyGroup comments on the src projects. The destructive deletion is reserved for Phase 2 PR 3 with explicit operator review per CLAUDE.md "only use destructive operations when truly the best approach". tests/ZB.MOM.WW.OtOpcUa.Tests/ renamed via git mv to tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/; csproj <AssemblyName> kept as the original ZB.MOM.WW.OtOpcUa.Tests so v1 OtOpcUa.Host's [InternalsVisibleTo("ZB.MOM.WW.OtOpcUa.Tests")] still matches and the project rebuilds clean. tests/ZB.MOM.WW.OtOpcUa.IntegrationTests gets <IsTestProject>false</IsTestProject>. src/ZB.MOM.WW.OtOpcUa.Host + src/ZB.MOM.WW.OtOpcUa.Historian.Aveva get PropertyGroup archive-status comments documenting they're functionally superseded but kept in-build because cascading dependencies (Historian.Aveva → Host; IntegrationTests → Host) make a single-PR deletion high blast-radius. New tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ project (.NET 10) with ParityFixture that spawns OtOpcUa.Driver.Galaxy.Host.exe (net48 x86) as a Process.Start subprocess with OTOPCUA_GALAXY_BACKEND=db env vars, awaits 2s for the PipeServer to bind, then exposes a connected GalaxyProxyDriver; skips on non-Windows / Administrator shells (PipeAcl denies admins per decision #76) / ZB unreachable / Host EXE not built — each skip carries a SkipReason string the test method reads via Assert.Skip(SkipReason). RecordingAddressSpaceBuilder captures every Folder/Variable/AddProperty registration so parity tests can assert on the same shape v1 LmxNodeManager produced. HierarchyParityTests (3) — Discover returns gobjects with attributes; attribute full references match the tag.attribute Galaxy reference grammar; HistoryExtension flag flows through correctly. StabilityFindingsRegressionTests (4) — one test per 2026-04-13 stability finding from commits c76ab8f and 7310925: phantom probe subscription doesn't corrupt unrelated host status; HostStatusChangedEventArgs structurally carries a specific HostName + OldState + NewState (event signature mathematically prevents the v1 cross-host quality-clear bug); all GalaxyProxyDriver capability methods return Task or Task<T> (sync-over-async would deadlock OPC UA stack thread); AcknowledgeAsync completes before returning (no fire-and-forget background work that could race shutdown). Solution test count: 470 pass / 7 skip (E2E on admin shell) / 1 pre-existing Phase 0 baseline. Run archived suites explicitly: dotnet test tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive (494 pass) + dotnet test tests/ZB.MOM.WW.OtOpcUa.IntegrationTests (6 pass). docs/v2/V1_ARCHIVE_STATUS.md inventories every archived surface with run-it-explicitly instructions + a 10-step deletion plan for PR 3 + rollback procedure (git revert restores all four projects). docs/v2/implementation/exit-gate-phase-2-final.md supersedes the two partial-exit docs with the per-stream status table (A/B/C/D/E all addressed, D split across PR 2/3 per safety protocol), the test count breakdown, fresh adversarial review of PR 2 deltas (4 new findings: medium IsTestProject=false safety net loss, medium structural-vs-behavioral stability tests, low backend=db default, low Process.Start env inheritance), the 8 carried-forward findings from exit-gate-phase-2.md, the recommended PR order (1 → 2 → 3 → 4). docs/v2/implementation/pr-2-body.md is the Gitea web-UI paste-in for opening PR 2 once pushed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:56:21 -04:00
Joseph Doherty
50f81a156d Doc — PR 1 body for Gitea web UI paste-in. PR title + summary + test matrix + reviewer test plan + follow-up tracking. Source phase-1-configuration → target v2; URL https://gitea.dohertylan.com/dohertj2/lmxopcua/pulls/new/phase-1-configuration. No gh/tea CLI on this box, so the body is staged here for the operator to paste into the Gitea web UI rather than auto-created via API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:46:23 -04:00
Joseph Doherty
7403b92b72 Phase 2 Stream D progress — non-destructive deliverables: appsettings → DriverConfig migration script, two-service Windows installer scripts, process-spawn cross-FX parity test, Stream D removal procedure doc with both Option A (rewrite 494 v1 tests) and Option B (archive + new v2 E2E suite) spelled out step-by-step. Cannot one-shot the actual legacy-Host deletion in any unattended session — explained in the procedure doc; the parity-defect debug cycle is intrinsically interactive (each iteration requires inspecting a v1↔v2 diff and deciding if it's a legitimate v2 improvement or a regression, then either widening the assertion or fixing the v2 code), and git rm -r src/ZB.MOM.WW.OtOpcUa.Host is destructive enough to need explicit operator authorization on a real PR review. scripts/migration/Migrate-AppSettings-To-DriverConfig.ps1 takes a v1 appsettings.json and emits the v2 DriverInstance.DriverConfig JSON blob (MxAccess/Database/Historian sections) ready to upsert into the central Configuration DB; null-leaf stripping; -DryRun mode; smoke-tested against the dev appsettings.json and produces the expected three-section ordered-dictionary output. scripts/install/Install-Services.ps1 registers the two v2 services with sc.exe — OtOpcUaGalaxyHost first (net48 x86 EXE with OTOPCUA_GALAXY_PIPE/OTOPCUA_ALLOWED_SID/OTOPCUA_GALAXY_SECRET/OTOPCUA_GALAXY_BACKEND/OTOPCUA_GALAXY_ZB_CONN/OTOPCUA_GALAXY_CLIENT_NAME env vars set via HKLM:\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost\Environment registry), then OtOpcUa with depend=OtOpcUaGalaxyHost; resolves down-level account names to SID for the IPC ACL; generates a fresh 32-byte base64 shared secret per install if not supplied (kept out of registry — operators record offline for service rebinding scenarios); echoes start commands. scripts/install/Uninstall-Services.ps1 stops + removes both services. tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy.Tests/HostSubprocessParityTests.cs is the production-shape parity test — Proxy (.NET 10) spawns the actual OtOpcUa.Driver.Galaxy.Host.exe (net48 x86) as a subprocess via Process.Start with backend=db env vars, connects via real named pipe, calls Discover, asserts at least one Galaxy gobject comes back. Skipped when running as Administrator (PipeAcl denies admins, same guard as other IPC integration tests), when the Host EXE hasn't been built, or when the ZB SQL endpoint is unreachable. This is the cross-FX integration that the parity suite genuinely needs — the previous IPC tests all ran in-process; this one validates the production deployment topology where Proxy and Host are separate processes communicating only over the named pipe. docs/v2/implementation/stream-d-removal-procedure.md is the next-session playbook: Option A (rewrite 494 v1 tests via a ProxyMxAccessClientAdapter that implements v1's IMxAccessClient by forwarding to GalaxyProxyDriver — Vtq↔DataValueSnapshot, Quality↔StatusCode, OnTagValueChanged↔OnDataChange mapping; 3-5 days, full coverage), Option B (rename OtOpcUa.Tests → OtOpcUa.Tests.v1Archive with [Trait("Category", "v1Archive")] for opt-in CI runs; new OtOpcUa.Driver.Galaxy.E2E test project with 10-20 representative tests via the HostSubprocessParityTests pattern; 1-2 days, accreted coverage); deletion checklist with eight pre-conditions, ten ordered steps, and a rollback path (git revert restores the legacy Host alongside the v2 stack — both topologies remain installable until the downstream consumer cutover). Full solution 964 pass / 1 pre-existing Phase 0 baseline; the 494 v1 IntegrationTests + 6 v1 IntegrationTests-net48 still pass because legacy OtOpcUa.Host stays untouched until an interactive session executes the procedure doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:38:44 -04:00
Joseph Doherty
a7126ba953 Phase 2 — port MXAccess COM client to Galaxy.Host + MxAccessGalaxyBackend (3rd IGalaxyBackend) + live MXAccess smoke + Phase 2 exit-gate doc + adversarial review. The full Galaxy data-plane now flows through the v2 IPC topology end-to-end against live ArchestrA.MxAccess.dll, on this dev box, with 30/30 Host tests + 9/9 Proxy tests + 963/963 solution tests passing alongside the unchanged 494 v1 IntegrationTests baseline. Backend/MxAccess/Vtq is a focused port of v1's Vtq value-timestamp-quality DTO. Backend/MxAccess/IMxProxy abstracts LMXProxyServer (port of v1's IMxProxy with the same Register/Unregister/AddItem/RemoveItem/AdviseSupervisory/UnAdviseSupervisory/Write surface + OnDataChange + OnWriteComplete events); MxProxyAdapter is the concrete COM-backed implementation that does Marshal.ReleaseComObject-loop on Unregister, must be constructed on an STA thread. Backend/MxAccess/MxAccessClient is the focused port of v1's MxAccessClient partials — Connect/Disconnect/Read/Write/Subscribe/Unsubscribe through the new Sta/StaPump (the real Win32 GetMessage pump from the previous commit), ConcurrentDictionary handle tracking, OnDataChange event marshalling to per-tag callbacks, ReadAsync implemented as the canonical subscribe → first-OnDataChange → unsubscribe one-shot pattern. Galaxy.Host csproj flipped to x86 PlatformTarget + Prefer32Bit=true with the ArchestrA.MxAccess HintPath ..\..\lib\ArchestrA.MxAccess.dll reference (lib/ already contains the production DLL). Backend/MxAccessGalaxyBackend is the third IGalaxyBackend implementation (alongside StubGalaxyBackend and DbBackedGalaxyBackend): combines GalaxyRepository (Discover) with MxAccessClient (Read/Write/Subscribe), MessagePack-deserializes inbound write values, MessagePack-serializes outbound read values into ValueBytes, decodes ArrayDimension/SecurityClassification/category_id with the same v1 mapping. Program.cs selects between stub|db|mxaccess via OTOPCUA_GALAXY_BACKEND env var (default = mxaccess); OTOPCUA_GALAXY_ZB_CONN overrides the ZB connection string; OTOPCUA_GALAXY_CLIENT_NAME sets the Wonderware client identity; the StaPump and MxAccessClient lifecycles are tied to the server.RunAsync try/finally so a clean Ctrl+C tears down the COM proxy via Marshal.ReleaseComObject before the pump's WM_QUIT. Live MXAccess smoke tests (MxAccessLiveSmokeTests, net48 x86) — skipped when ZB unreachable or aaBootstrap not running, otherwise verify (1) MxAccessClient.ConnectAsync returns a positive LMXProxyServer handle on the StaPump, (2) MxAccessGalaxyBackend.OpenSession + Discover returns at least one gobject with attributes, (3) MxAccessGalaxyBackend.ReadValues against the first discovered attribute returns a response with the correct TagReference shape (value + quality vary by what's running, so we don't assert specific values). All 3 pass on this dev box. EndToEndIpcTests + IpcHandshakeIntegrationTests moved from Galaxy.Proxy.Tests (net10) to Galaxy.Host.Tests (net48 x86) — the previous test placement silently dropped them at xUnit discovery because Host became net48 x86 and net10 process can't load it. Rewritten to use Shared's FrameReader/FrameWriter directly instead of going through Proxy's GalaxyIpcClient (functionally equivalent — same wire protocol, framing primitives + dispatcher are the production code path verbatim). 7 IPC tests now run cleanly: Hello+heartbeat round-trip, wrong-secret rejection, OpenSession session-id assignment, Discover error-response surfacing, WriteValues per-tag bad status, Subscribe id assignment, Recycle grace window. Phase 2 exit-gate doc (docs/v2/implementation/exit-gate-phase-2.md) supersedes the partial-exit doc with the as-built state — Streams A/B/C complete; D/E gated only on the legacy-Host removal + parity-test rewrite cycle that fundamentally requires multi-day debug iteration; full adversarial-review section ranking 8 findings (2 high, 3 medium, 3 low) all explicitly deferred to Stream D/E or v2.1 with rationale; Stream-D removal checklist gives the next-session entry point with two policy options for the 494 v1 tests (rewrite-to-use-Proxy vs archive-and-write-smaller-v2-parity-suite). Cannot one-shot Stream D.1 in any single session because deleting OtOpcUa.Host requires the v1 IntegrationTests cycle to be retargeted first; that's the structural blocker, not "needs more code" — and the plan itself budgets 3-4 weeks for it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:23:24 -04:00
Joseph Doherty
32eeeb9e04 Phase 2 Streams A+B+C feature-complete — real Win32 pump, all 9 IDriver capabilities, end-to-end IPC dispatch. Streams D+E remain (Galaxy MXAccess code lift + parity-debug cycle, plan-budgeted 3-4 weeks). The 494 v1 IntegrationTests still pass — legacy OtOpcUa.Host untouched. StaPump replaces the BlockingCollection placeholder with a real Win32 message pump lifted from v1 StaComThread per CLAUDE.md "Reference Implementation": dedicated STA Thread with SetApartmentState(STA), GetMessage/PostThreadMessage/PeekMessage/TranslateMessage/DispatchMessage/PostQuitMessage P/Invoke, WM_APP=0x8000 for work-item dispatch, WM_APP+1 for graceful-drain → PostQuitMessage, peek-pm-noremove on entry to force the system to create the thread message queue before signalling Started, IsResponsiveAsync probe still no-op-round-trips through PostThreadMessage so the wedge detection works against the real pump. Concurrent ConcurrentQueue<WorkItem> drains on every WM_APP; fault path on dispose drains-and-faults all pending work-item TCSes with InvalidOperationException("STA pump has exited"). All three StaPumpTests pass against the real pump (apartment state STA, healthy probe true, wedged probe false). GalaxyProxyDriver now implements every Phase 2 Stream C capability — IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IAlarmSource, IHistoryProvider, IRediscoverable, IHostConnectivityProbe — each forwarding through the matching IPC contract. ReadAsync preserves request order even when the Host returns out-of-order values; WriteAsync MessagePack-serializes the value into ValueBytes; SubscribeAsync wraps SubscriptionId in a GalaxySubscriptionHandle record; UnsubscribeAsync uses the new SendOneWayAsync helper on GalaxyIpcClient (fire-and-forget but still gated through the call-semaphore so it doesn't interleave with CallAsync); AlarmSubscribe is one-way and the Host pushes events back via OnAlarmEvent; ReadProcessedAsync short-circuits to NotSupportedException (Galaxy historian only does raw); IRediscoverable's OnRediscoveryNeeded fires when the Host pushes a deploy-watermark notification; IHostConnectivityProbe.GetHostStatuses() snapshots and OnHostStatusChanged fires on Running↔Stopped/Faulted transitions, with IpcHostConnectivityStatus aliased to disambiguate from the Core.Abstractions namespace's same-named type. Internal RaiseDataChange/RaiseAlarmEvent/RaiseRediscoveryNeeded/OnHostConnectivityUpdate methods are the entry points the IPC client will invoke when push frames arrive. Host side: new Backend/IGalaxyBackend interface defines the seam between IPC dispatch and the live MXAccess code (so the dispatcher is unit-testable against an in-memory mock without needing live Galaxy); Backend/StubGalaxyBackend returns success for OpenSession/CloseSession/Subscribe/Unsubscribe/AlarmSubscribe/AlarmAck/Recycle and a recognizable "stub: MXAccess code lift pending (Phase 2 Task B.1)"-tagged error for Discover/ReadValues/WriteValues/HistoryRead — keeps the IPC end-to-end testable today and gives the parity team a clear seam to slot the real implementation into; Ipc/GalaxyFrameHandler is the new real dispatcher (replaces StubFrameHandler in Program.cs) — switch on MessageKind, deserialize the matching contract, await backend method, write the response (one-way for Unsubscribe/AlarmSubscribe/AlarmAck/CloseSession), heartbeat handled inline so liveness still works if the backend is sick, exceptions caught and surfaced as ErrorResponse with code "handler-exception" so the Proxy raises GalaxyIpcException instead of disconnecting. End-to-end IPC integration test (EndToEndIpcTests) drives every operation through the full stack — Initialize → Read → Write → Subscribe → Unsubscribe → SubscribeAlarms → AlarmAck → ReadRaw → ReadProcessed (short-circuit) — proving the wire protocol, dispatcher, capability forwarding, and one-way semantics agree end-to-end. Skipped on Windows administrator shells per the same PipeAcl-denies-Administrators reasoning the IpcHandshakeIntegrationTests use. Full solution 952 pass / 1 pre-existing Phase 0 baseline. Phase 2 evidence doc updated: status header now reads "Streams A+B+C complete... Streams D+E remain — gated only on the iterative Galaxy code lift + parity-debug cycle"; new Update 2026-04-17 (later) callout enumerates the upgrade with explicit "what's left for the Phase 2 exit gate" — replace StubGalaxyBackend with a MxAccessClient-backed implementation calling on the StaPump, then run the v1 IntegrationTests against the v2 topology and iterate on parity defects until green, then delete legacy OtOpcUa.Host.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 23:02:00 -04:00
Joseph Doherty
a1e9ed40fb Doc — record that this dev box (DESKTOP-6JL3KKO) hosts the full AVEVA stack required for the LmxOpcUa Phase 2 breakout, removing the "needs live MXAccess runtime" environmental blocker that the partial-exit evidence cited as gating Streams D + E. Inventory verified via Get-Service: 27 ArchestrA / Wonderware / AVEVA services running including aaBootstrap, aaGR (Galaxy Repository), aaLogger, aaUserValidator, aaPim, ArchestrADataStore, AsbServiceManager, AutoBuild_Service; the full Historian set (aahClientAccessPoint, aahGateway, aahInSight, aahSearchIndexer, aahSupervisor, InSQLStorage, InSQLConfiguration, InSQLEventSystem, InSQLIndexing, InSQLIOServer, InSQLManualStorage, InSQLSystemDriver, HistorianSearch-x64); slssvc (Wonderware SuiteLink); MXAccess COM DLL at C:\Program Files (x86)\ArchestrA\Framework\bin\ArchestrA.MXAccess.dll plus the matching .tlb files; OI-Gateway install at C:\Program Files (x86)\Wonderware\OI-Server\OI-Gateway\ — which means the Phase 1 Task E.10 AppServer-via-OI-Gateway smoke test (decision #142) is *also* runnable on the same box, not blocked on a separate AVEVA test machine as the original deferral assumed. dev-environment.md inventory row for "Dev Galaxy" now lists every service and file path; status flips to "Fully available — Phase 2 lift unblocked"; the GLAuth row also fills out v2.4.0 actual install details (direct-bind cn={user},dc=lmxopcua,dc=local; users readonly/writeop/writetune/writeconfig/alarmack/admin/serviceaccount; running under NSSM service GLAuth; current GroupToRole mapping ReadOnly→ConfigViewer / WriteOperate→ConfigEditor / AlarmAck→FleetAdmin) and notes the v2-rebrand to dc=otopcua,dc=local is a future cosmetic change. phase-2-partial-exit-evidence.md status header gains "runtime now in place"; an Update 2026-04-17 callout enumerates the same service inventory and concludes "no environmental blocker remains"; the next-session checklist's first step changes from "stand up dev Galaxy" to "verify the local AVEVA stack is still green (Get-Service aaGR, aaBootstrap, slssvc → Running) and the Galaxy ZB repository is reachable" with a new step 9 calling out that the AppServer-via-OI-Gateway smoke test should now be folded in opportunistically. plan.md §"4. Galaxy/MXAccess as Out-of-Process Driver" gains a "Dev environment for the LmxOpcUa breakout" paragraph documenting which physical machine has the runtime so the planning doc no longer reads as if AVEVA capability were a future logistical concern. No source / test changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:42:15 -04:00
Joseph Doherty
01fd90c178 Phase 1 Streams B–E scaffold + Phase 2 Streams A–C scaffold — 8 new projects with ~70 new tests, all green alongside the 494 v1 IntegrationTests baseline (parity preserved: no v1 tests broken; legacy OtOpcUa.Host untouched). Phase 1 finish: Configuration project (16 entities + 10 enums + DbContext + DesignTimeDbContextFactory + InitialSchema/StoredProcedures/AuthorizationGrants migrations — 8 procs including sp_PublishGeneration with MERGE on ExternalIdReservation per decision #124, sp_RollbackToGeneration cloning rows into a new published generation, sp_ValidateDraft with cross-cluster-namespace + EquipmentUuid-immutability + ZTag/SAPID reservation pre-flight, sp_ComputeGenerationDiff with CHECKSUM-based row signature — plus OtOpcUaNode/OtOpcUaAdmin SQL roles with EXECUTE grants scoped to per-principal-class proc sets and DENY UPDATE/DELETE/INSERT/SELECT on dbo schema); managed DraftValidator covering UNS segment regex, path length, EquipmentUuid immutability across generations, same-cluster namespace binding (decision #122), reservation pre-flight, EquipmentId derivation (decision #125), driver↔namespace compatibility — returning every failing rule in one pass; LiteDB local cache with round-trip + ring pruning + corruption-fast-fail; GenerationApplier with per-entity Added/Removed/Modified diff and dependency-ordered callbacks (namespace → driver → device → equipment → poll-group → tag, Removed before Added); Core project with GenericDriverNodeManager (scaffold for the Phase 2 Galaxy port) and DriverHost lifecycle registry; Server project using Microsoft.Extensions.Hosting BackgroundService replacing TopShelf, with NodeBootstrap that falls back to LiteDB cache when the central DB is unreachable (decision #79); Admin project scaffolded as Blazor Server with Bootstrap 5 sidebar layout, cookie auth, three admin roles (ConfigViewer/ConfigEditor/FleetAdmin), Cluster + Generation services fronting the stored procs. Phase 2 scaffold: Driver.Galaxy.Shared (netstandard2.0) with full MessagePack IPC contract surface — Hello version negotiation, Open/CloseSession, Heartbeat, DiscoverHierarchy + GalaxyObjectInfo/GalaxyAttributeInfo, Read/WriteValues, Subscribe/Unsubscribe/OnDataChange, AlarmSubscribe/Event/Ack, HistoryRead, HostConnectivityStatus, Recycle — plus length-prefixed framing (decision #28) with a 16 MiB cap and thread-safe FrameWriter/FrameReader; Driver.Galaxy.Host (net48) implementing the Tier C cross-cutting protections from driver-stability.md — strict PipeAcl (allow configured server SID only, explicit deny on LocalSystem + Administrators), PipeServer with caller-SID verification via pipe.RunAsClient + WindowsIdentity.GetCurrent and per-process shared-secret Hello, Galaxy-specific MemoryWatchdog (warn at max(1.5×baseline, +200 MB), soft-recycle at max(2×baseline, +200 MB), hard ceiling 1.5 GB, slope ≥5 MB/min over 30-min rolling window), RecyclePolicy (1 soft recycle per hour cap + 03:00 local daily scheduled), PostMortemMmf (1000-entry ring buffer in %ProgramData%\OtOpcUa\driver-postmortem\galaxy.mmf, survives hard crash, readable cross-process), MxAccessHandle : SafeHandle (ReleaseHandle loops Marshal.ReleaseComObject until refcount=0 then calls optional unregister callback), StaPump with responsiveness probe (BlockingCollection dispatcher for Phase 1 — real Win32 GetMessage/DispatchMessage pump slots in with the same semantics when the Galaxy code lift happens), IsExternalInit shim for init setters on .NET 4.8; Driver.Galaxy.Proxy (net10) implementing IDriver + ITagDiscovery forwarding over the IPC channel with MX data-type and security-classification mapping, plus Supervisor pieces — Backoff (5s → 15s → 60s capped, reset-on-stable-run), CircuitBreaker (3 crashes per 5 min opens; 1h → 4h → manual cooldown escalation; sticky alert doesn't auto-clear), HeartbeatMonitor (2s cadence, 3 consecutive misses = host dead per driver-stability.md). Infrastructure: docker SQL Server remapped to host port 14330 to coexist with the native MSSQL14 Galaxy ZB DB instance on 1433; NuGetAuditSuppress applied per-project for two System.Security.Cryptography.Xml advisories that only reach via EF Core Design with PrivateAssets=all (fix ships in 11.0.0-preview); .slnx gains 14 project registrations. Deferred with explicit TODOs in docs/v2/implementation/phase-2-partial-exit-evidence.md: Phase 1 Stream E Admin UI pages (Generations listing + draft-diff-publish, Equipment CRUD with OPC 40010 fields, UNS Areas/Lines tabs, ACLs + permission simulator, Generic JSON config editor, SignalR real-time, Release-Reservation + Merge-Equipment workflows, LDAP login page, AppServer smoke test per decision #142), Phase 2 Stream D (Galaxy MXAccess code lift out of legacy OtOpcUa.Host, dual-service installer, appsettings → DriverConfig migration script, legacy Host deletion — blocked by parity), Phase 2 Stream E (v1 IntegrationTests against v2 topology, Client.CLI walkthrough diff, four 2026-04-13 stability findings regression tests, adversarial review — requires live MXAccess runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:35:25 -04:00
Joseph Doherty
fc0ce36308 Add Installed Inventory section to dev-environment.md tracking every v2 dev service, toolchain, credential, port, data location, and container volume stood up on this machine. Records what is actually running (not just planned) so future setup work and troubleshooting has a single source of truth. Four subsections: Host (machine identity, VM platform, CPU, OS features); Toolchain (.NET 10 SDK 10.0.201 + runtimes 10.0.5, WSL2 default v2 with docker-desktop distro Running, Docker Desktop 29.3.1 / engine 29.3.1, dotnet-ef CLI 10.0.6 — each row records install method and date); Services (SQL Server 2022 container otopcua-mssql at localhost:1433 with sa/OtOpcUaDev_2026! credentials and Docker named volume otopcua-mssql-data mounted at /var/opt/mssql, dev Galaxy, GLAuth at C:\publish\glauth\ on ports 3893/3894, plus rows for not-yet-standing services like OPC Foundation reference server / FOCAS stub / Modbus simulator / ab_server / Snap7 / TwinCAT XAR VM with target ports to stand up later); Connection strings for appsettings.Development.json (copy-paste-ready, flagged never-commit); Container management quick reference (start/stop/logs/shell/query/nuclear-reset); Credential rotation note.
Per decision #137 (dev env credentials documented openly in dev-environment.md; production uses Integrated Security / gMSA per decision #46 and never any value from this table). Section lives at the top of the doc immediately after Two Environment Tiers, so it's discoverable as the single source of truth for "what's actually running here right now".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 16:57:09 -04:00
Joseph Doherty
bf6741ba7f Doc — flesh out dev-environment.md inner-loop bootstrap with the explicit Windows install steps that surfaced when actually trying to stand up SQL Server on the local box: prereq winget commands per tool (.NET 10 SDK / .NET Framework 4.8 SDK + targeting pack / Git / PowerShell 7.4+); WSL2 install (UAC-elevated) as a separate sub-step before Docker Desktop; Docker Desktop install (UAC-elevated) followed by sign-out/sign-in for docker-users group membership; explicit post-install Docker Desktop config checklist (WSL 2 based engine = checked, Windows containers = NOT checked, WSL Integration enabled for Ubuntu) per decision #134; named volume otopcua-mssql-data:/var/opt/mssql on the SQL Server container so DB files survive container restart and docker rm; sqlcmd verification command using the new mssql-tools18 path that the 2022 image ships with; EF Core CLI install for use starting in Phase 1 Stream B; bumped step count from 8 → 10. Also adds a Troubleshooting subsection covering the seven most common Windows install snags (WSL distro not auto-installed needs -d Ubuntu; Docker PATH not refreshed needs new shell or sign-in; docker-users group membership needs sign-out/in; WSL 2 kernel update needs manual install on legacy systems; SA password complexity rules; Linux vs Windows containers mode mismatch; Hyper-V coexistence with Docker requires WSL 2 backend not Hyper-V backend per decision #134). Step 1 acceptance criteria gain "docker ps shows otopcua-mssql Up" and explicit note that steps 4a/4b need admin elevation (no silent admin-free path exists on Windows).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:54:52 -04:00
Joseph Doherty
980ea5190c Phase 1 Stream A — Core.Abstractions project + 11 capability interfaces + DriverTypeRegistry + interface-independence tests
New project src/ZB.MOM.WW.OtOpcUa.Core.Abstractions (.NET 10, BCL-only dependencies, GenerateDocumentationFile=true, TreatWarningsAsErrors=true) defining the contract surface every driver implements. Per docs/v2/plan.md decisions #4 (composable capability interfaces), #52 (streaming IAddressSpaceBuilder), #53 (capability discovery via `is` checks no flag enum), #54 (optional IRediscoverable sub-interface), #59 (Core.Abstractions internal-only for now design as if public).

Eleven capability interfaces:
- IDriver — required lifecycle / health / config-apply / memory-footprint accounting (per driver-stability.md Tier A/B allocation tracking)
- ITagDiscovery — discovers tags streaming to IAddressSpaceBuilder
- IReadable — on-demand reads idempotent for Polly retry
- IWritable — writes NOT auto-retried by default per decisions #44 + #45
- ISubscribable — data-change subscriptions covering both native (Galaxy MXAccess advisory, OPC UA monitored items, TwinCAT ADS) and driver-internal polled (Modbus, AB CIP, S7, FOCAS) mechanisms; OnDataChange callback regardless of source
- IAlarmSource — alarm events + acknowledge + AlarmSeverity enum mirroring acl-design.md NodePermissions alarm-severity values
- IHistoryProvider — HistoryReadRaw + HistoryReadProcessed with continuation points
- IRediscoverable — opt-in change-detection signal; static drivers don't implement
- IHostConnectivityProbe — generalized from Galaxy's GalaxyRuntimeProbeManager per plan §5a
- IDriverConfigEditor — Admin UI plug-point for per-driver custom config editors deferred to each driver's phase per decision #27
- IAddressSpaceBuilder — streaming builder API for driver-driven address-space construction

Plus DTOs: DriverDataType, SecurityClassification (mirroring v1 Galaxy model), DriverAttributeInfo (replaces Galaxy-specific GalaxyAttributeInfo per plan §5a), DriverHealth + DriverState, DataValueSnapshot (universal OPC UA quality + timestamp carrier per decision #13), HostConnectivityStatus + HostState + HostStatusChangedEventArgs, RediscoveryEventArgs, DataChangeEventArgs, AlarmEventArgs + AlarmAcknowledgeRequest + AlarmSeverity, WriteRequest + WriteResult, HistoryReadResult + HistoryAggregateType, ISubscriptionHandle + IAlarmSubscriptionHandle + IVariableHandle.

DriverTypeRegistry singleton with Register / Get / TryGet / All; thread-safe via Interlocked.Exchange snapshot replacement on registration; case-insensitive lookups; rejects duplicate registrations; rejects empty type names. DriverTypeMetadata record carries TypeName + AllowedNamespaceKinds (NamespaceKindCompatibility flags enum per decision #111) + per-config-tier JSON Schemas the validator checks at draft-publish time (decision #91).

Tests project tests/ZB.MOM.WW.OtOpcUa.Core.Abstractions.Tests (xUnit v3 1.1.0 matching existing test projects). 24 tests covering: 1) interface independence reflection check (no references outside BCL/System; all public types in root namespace; every capability interface is public); 2) DriverTypeRegistry round-trip, case-insensitive lookups, KeyNotFoundException on unknown, null on TryGet of unknown, InvalidOperationException on duplicate registration (case-insensitive too), All() enumeration, NamespaceKindCompatibility bitmask combinations, ArgumentException on empty type names.

Build: 0 errors, 4 warnings (only pre-existing transitive package vulnerability + analyzer hints). Full test suite: 845 passing / 1 failing — strict improvement over Phase 0 baseline (821/1) by the 24 new Core.Abstractions tests; no regressions in any other test project.

Phase 1 entry-gate record (docs/v2/implementation/entry-gate-phase-1.md) documents the deviation: only Stream A executed in this continuation since Streams B-E need SQL Server / GLAuth / Galaxy infrastructure standup per dev-environment.md Step 1, which is currently TODO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:15:55 -04:00
Joseph Doherty
3b2defd94f Phase 0 — mechanical rename ZB.MOM.WW.LmxOpcUa.* → ZB.MOM.WW.OtOpcUa.*
Renames all 11 projects (5 src + 6 tests), the .slnx solution file, all source-file namespaces, all axaml namespace references, and all v1 documentation references in CLAUDE.md and docs/*.md (excluding docs/v2/ which is already in OtOpcUa form). Also updates the TopShelf service registration name from "LmxOpcUa" to "OtOpcUa" per Phase 0 Task 0.6.

Preserves runtime identifiers per Phase 0 Out-of-Scope rules to avoid breaking v1/v2 client trust during coexistence: OPC UA `ApplicationUri` defaults (`urn:{GalaxyName}:LmxOpcUa`), server `EndpointPath` (`/LmxOpcUa`), `ServerName` default (feeds cert subject CN), `MxAccessConfiguration.ClientName` default (defensive — stays "LmxOpcUa" for MxAccess audit-trail consistency), client OPC UA identifiers (`ApplicationName = "LmxOpcUaClient"`, `ApplicationUri = "urn:localhost:LmxOpcUaClient"`, cert directory `%LocalAppData%\LmxOpcUaClient\pki\`), and the `LmxOpcUaServer` class name (class rename out of Phase 0 scope per Task 0.5 sed pattern; happens in Phase 1 alongside `LmxNodeManager → GenericDriverNodeManager` Core extraction). 23 LmxOpcUa references retained, all enumerated and justified in `docs/v2/implementation/exit-gate-phase-0.md`.

Build clean: 0 errors, 30 warnings (lower than baseline 167). Tests at strict improvement over baseline: 821 passing / 1 failing vs baseline 820 / 2 (one flaky pre-existing failure passed this run; the other still fails — both pre-existing and unrelated to the rename). `Client.UI.Tests`, `Historian.Aveva.Tests`, `Client.Shared.Tests`, `IntegrationTests` all match baseline exactly. Exit gate compliance results recorded in `docs/v2/implementation/exit-gate-phase-0.md` with all 7 checks PASS or DEFERRED-to-PR-review (#7 service install verification needs Windows service permissions on the reviewer's box).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:57:47 -04:00
Joseph Doherty
5b8d708c58 Phase 0 entry gate + branch-naming convention fix
Record Phase 0 entry baseline: 820 passing, 2 pre-existing failures (Client.CLI.Tests.SubscribeCommandTests.Execute_PrintsSubscriptionMessage and Tests.MxAccess.MxAccessClientMonitorTests.Monitor_ProbeDataChange_PreventsStaleReconnect), 0 build errors, 167 build warnings. The two failures exist on v2 as of commit 1189dc8 and are unrelated to the rename. Phase 0 exit gate adapts the requirement to "failure count = baseline (2); pass count ≥ baseline (820)".

Branch-naming convention updated in implementation/overview.md and phase-0 doc: cannot use `v2/phase-N-slug` form because git treats `/` as path separator and `v2` already exists as a branch, blocking creation of any `v2/...` branch. Convention is now `phase-N-slug` (no v2/ prefix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:46:26 -04:00
Joseph Doherty
1189dc87fd Close corrections-doc E2 (Aveva System Platform IO upstream-OPC-UA pattern verification) with GREEN-YELLOW verdict (decision #141) — AVEVA's OI Gateway communication driver is the documented path for AppServer to consume from arbitrary upstream OPC UA servers; multiple AVEVA partners (Software Toolbox, InSource) have published end-to-end integrations against four different non-AVEVA upstream servers (TOP Server, OPC Router, OmniServer, Cogent DataHub). No re-architecting of OtOpcUa required. Path: OPC UA node → OI Gateway → SuiteLink → $DDESuiteLinkDIObject → AppServer attribute. Recommended AppServer floor: System Platform 2023 R2 Patch 01.
OtOpcUa-side requirements all met or trivially met by v2: Basic256Sha256 + SignAndEncrypt + username token (transport security covers this), reject-and-trust cert workflow, endpoint URL must NOT include /discovery suffix (forum-documented failure mode), hostname-stable certs (decision #86 already enforces this since clients pin trust to ApplicationUri), OI Gateway service must NOT run under SYSTEM (deployment-guide concern). Two integrator-burden risks tracked: validation/GxP paperwork (no AVEVA blueprint exists for non-AVEVA upstream servers in Part 11 deployments — engage QA/regulatory in Year 1) and unpublished scale benchmarks (in-house benchmark required in Year 2 before cutover scheduling).

Phase 1 acceptance gains Task E.10 (decision #142): end-to-end AppServer-via-OI-Gateway smoke test against a Phase 1 OtOpcUa instance, catching AppServer-specific quirks (cert exchange, endpoint URL handling, service account, security mode combo) well before the Year 3 tier-3 cutover schedule. Non-blocking for Phase 1 exit if it surfaces only documentation-level fixes; blocking if it surfaces architectural incompatibility.

New file `docs/v2/aveva-system-platform-io-research.md` captures the full research with all source citations (AVEVA docs, Communications Drivers Pack readmes, Software Toolbox / InSource partner walkthroughs, Inductive Automation forum failure-mode reports). Plan.md decision log gains #141 and #142; Reference Documents section links the new doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:27:13 -04:00
Joseph Doherty
0307a498ed Resolve enterprise shortname = zb (decision #140, closes corrections-doc D4) — matches the existing ZB.MOM.WW.* namespace prefix used throughout the codebase, short by design since this segment appears in every equipment path (zb/warsaw-west/bldg-3/line-2/cnc-mill-05/RunState), operators already say "ZB" colloquially. Updated plan.md UNS browse-path example, plan.md Namespace schema sketch NamespaceUri example, plan.md ServerCluster schema sketch Enterprise comment, admin-ui.md cluster-create workflow form (Enterprise field default-prefilled zb), config-db-schema.md ServerCluster.Enterprise column comment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 13:12:41 -04:00
Joseph Doherty
d8fa3a0ddd Add OPC 40010 Machinery identity columns to the Equipment table (Manufacturer, Model, SerialNumber, HardwareRevision, SoftwareRevision, YearOfConstruction, AssetLocation, ManufacturerUri, DeviceManualUri — all nullable so equipment can be added before identity is fully captured) and surface them via a dedicated Identification panel in the Admin UI Equipment editor. These pair with the new _base equipment-class template seeded in 3yearplan/schemas/classes/_base.json that every other class extends — Manufacturer + Model are declared isRequired: true in the base, the rest optional. Drivers that can read these dynamically (FANUC cnc_sysinfo() returning SoftwareRevision, Beckhoff TwinCAT.SystemInfo, etc.) override the static value at runtime; static value is the operator-set fallback. Exposed on the OPC UA equipment node under the OPC 40010-standard Identification sub-folder per the category → folder mapping in schemas/docs/format-decisions.md D10. Decisions #138 (_base class with extends inheritance, references OPC 40010 + Part 9 + ISO 22400 + handoff canonical state vocabulary) and #139 (Equipment table identity columns) capture the rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 12:54:35 -04:00
Joseph Doherty
4903a19ec9 Add data-path ACL design (acl-design.md, closes corrections B1) + dev-environment inventory and setup plan (dev-environment.md), and remove consumer cutover from OtOpcUa v2 scope.
ACL design defines NodePermissions bitmask flags covering Browse / Read / Subscribe / HistoryRead / WriteOperate / WriteTune / WriteConfigure / AlarmRead / AlarmAcknowledge / AlarmConfirm / AlarmShelve / MethodCall plus common bundles (ReadOnly / Operator / Engineer / Admin); 6-level scope hierarchy (Cluster / Namespace / UnsArea / UnsLine / Equipment / Tag) with default-deny + additive grants and Browse-implication on ancestors; per-LDAP-group grants in a new generation-versioned NodeAcl table edited via the same draft → diff → publish → rollback boundary as every other content table; per-session permission-trie evaluator with O(depth × group-count) cost cached for the lifetime of the session and rebuilt on generation-apply or LDAP group cache expiry; cluster-create workflow seeds a default ACL set matching the v1 LmxOpcUa LDAP-role-to-permission map for v1 → v2 consumer migration parity; Admin UI ACL tab with two views (by LDAP group, by scope), bulk-grant flow, and permission simulator that lets operators preview "as user X" effective permissions across the cluster's UNS tree before publishing; explicit Deny deferred to v2.1 since verbose grants suffice at v2.0 fleet sizes; only denied OPC UA operations are audit-logged (not allowed ones — would dwarf the audit log). Schema doc gains the NodeAcl table with cross-cluster invariant enforcement and same-generation FK validation; admin-ui.md gains the ACLs tab; phase-1 doc gains Task E.9 wiring this through Stream E plus a NodeAcl entry in Task B.1's DbContext list.

Dev-environment doc inventories every external resource the v2 build needs across two tiers per decision #99 — inner-loop (in-process simulators on developer machines: SQL Server local or container, GLAuth at C:\publish\glauth\, local dev Galaxy) and integration (one dedicated Windows host with Docker Desktop on WSL2 backend so TwinCAT XAR VM can run in Hyper-V alongside containerized oitc/modbus-server, plus WSL2-hosted Snap7 and ab_server, plus OPC Foundation reference server, plus FOCAS TestStub and FaultShim) — with concrete container images, ports, default dev credentials (clearly marked dev-only since production uses Integrated Security / gMSA per decision #46), bootstrap order for both tiers, network topology diagram, test data seed locations, and operational risks (TwinCAT trial expiry automation, Docker pricing, integration host SPOF mitigation, per-developer GLAuth config sync, Aveva license scoping that keeps Galaxy tests on developer machines and off the shared host).

Removes consumer cutover (ScadaBridge / Ignition / System Platform IO) from OtOpcUa v2 scope per decision #136 — owned by a separate integration / operations team, tracked in 3-year-plan handoff §"Rollout Posture" and corrections §C5; OtOpcUa team's scope ends at Phase 5. Updates implementation/overview.md phase index to drop the "6+" row and add an explicit "OUT of v2 scope" callout; updates phase-1 and phase-2 docs to reframe cutover as integration-team-owned rather than future-phase numbered.

Decisions #129–137 added: ACL model (#129), NodeAcl generation-versioned (#130), v1-compatibility seed (#131), denied-only audit logging (#132), two-tier dev environment (#133), Docker WSL2 backend for TwinCAT VM coexistence (#134), TwinCAT VM centrally managed / Galaxy on dev machines only (#135), cutover out of v2 scope (#136), dev credentials documented openly (#137).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:58:33 -04:00
Joseph Doherty
2a6c9828e4 Confirm the v2 driver list as fixed (decision #128) and remove the Equipment Protocol Survey from the v2 prerequisites — the seven committed drivers (Modbus TCP including DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client) plus Galaxy/MXAccess are confirmed by direct knowledge of the equipment estate (TwinCAT and AB Legacy specifically called out by the OtOpcUa team based on known Beckhoff installations and SLC/MicroLogix legacy equipment); the survey may still inform long-tail driver scoping and per-site capacity planning but adding/removing drivers from the v2 implementation list is now out of scope. Phase-1 implementation doc loses the survey row from its Out-of-Scope table.
Add Phase 2 detailed implementation plan (docs/v2/implementation/phase-2-galaxy-out-of-process.md) covering the largest refactor phase — moving Galaxy from the legacy in-process OtOpcUa.Host project into the Tier C out-of-process topology specified in driver-stability.md. Five work streams: A. Driver.Galaxy.Shared (.NET Standard 2.0 IPC contracts using MessagePack with hello-message version negotiation), B. Driver.Galaxy.Host (.NET 4.8 x86 separate Windows service that owns MxAccessBridge / GalaxyRepository / alarm tracking / GalaxyRuntimeProbeManager / Wonderware Historian SDK / STA thread + Win32 message pump with health probe / MxAccessHandle SafeHandle for COM lifetime / subscription registry with cross-host quality scoping / named-pipe IPC server with mandatory ACL + caller SID verification + per-process shared secret / memory watchdog with Galaxy-specific 1.5x baseline + 200MB floor + 1.5GB ceiling / recycle policy with 15s grace + WM_QUIT escalation to hard-exit / post-mortem MMF writer / Driver.Galaxy.FaultShim test-only assembly), C. Driver.Galaxy.Proxy (.NET 10 in-process driver implementing every capability interface, heartbeat sender on dedicated channel with 2s/3-miss tolerance, supervisor with respawn-with-backoff and crash-loop circuit breaker with escalating cooldown 1h/4h/24h, address space build via IAddressSpaceBuilder producing byte-equivalent v1 output), D. Retire legacy OtOpcUa.Host (delete from solution, two-service Windows installer, migrate appsettings.json Galaxy sections to central DB DriverConfig blob), E. Parity validation (v1 IntegrationTests pass count = baseline failures = 0, scripted Client.CLI walkthrough output diff vs v1 only differs in timestamps/latency, four named regression tests for the 2026-04-13 stability findings). Compliance script verifies all eight Tier C cross-cutting protections have named passing tests. Decision #128 captures the survey-removal; cross-references added to plan.md Reference Documents and overview.md phase index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:35:04 -04:00
Joseph Doherty
592fa79e3c Add Phase 0 + Phase 1 detailed implementation plans under docs/v2/implementation/ with a phase-gate model so the work can be verified for compliance to the v2 design as it lands. Three-gate structure per phase (entry / mid / exit) with explicit compliance-check categories: schema compliance (live DB introspected against config-db-schema.md DDL via xUnit), decision compliance (every decision number cited in the phase doc must have at least one code/test citation in the codebase, verified via git grep), visual compliance (Admin UI screenshots reviewed side-by-side against ScadaLink CentralUI's equivalent screens), behavioral compliance (per-phase end-to-end smoke test that always passes at exit, never "known broken fix later"), stability compliance (cross-cutting protections from driver-stability.md wired up and regression-tested for Tier C drivers), and documentation compliance (any deviation from v2 design docs reflected back as decision-log updates with explicit "supersedes" notes). Exit gate requires two-reviewer signoff and an exit-gate-{phase}.md record; silent deviation is the failure mode the gates exist to make impossible to ship. Phase 0 doc covers the mechanical LmxOpcUa → OtOpcUa rename with 9 tasks, 7 compliance checks, and a completion checklist that gates on baseline test count parity. Phase 1 doc covers the largest greenfield phase — 5 work streams (Core.Abstractions, Configuration project with EF Core schema + stored procs + LiteDB cache + generation-diff applier, Core with GenericDriverNodeManager rename + IAddressSpaceBuilder + driver isolation, Server with Microsoft.Extensions.Hosting replacing TopShelf + credential-bound bootstrap, Admin Blazor Server app mirroring ScadaLink CentralUI verbatim with LDAP cookie auth + draft/diff/publish workflow + UNS structure management + equipment CRUD + release-reservation and merge-equipment operator flows) — with task-level acceptance criteria, a 14-step end-to-end smoke test, and decision citation requirements for #1-125. New decisions #126-127 capture the gate model and per-phase doc structure. Cross-references added to plan.md Reference Documents section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:25:09 -04:00
Joseph Doherty
a59ad2e0c6 Harden v2 design against the four findings from the 2026-04-17 Codex adversarial review of the db schema and admin UI: (1) DriverInstance.NamespaceId now enforces a same-cluster invariant in three layers (sp_ValidateDraft cross-table check using the new UX_Namespace_Generation_LogicalId_Cluster composite index, server-side namespace-selection API scoping that prevents bypass via crafted requests, and audit-log entries on cross-cluster attempts) so a draft for cluster A can no longer bind to cluster B's namespace and leak its URI into A's endpoint; (2) the Namespace table moves from cluster-level to generation-versioned with append-only logical-ID identity and locked NamespaceUri/Kind across generations so admins can no longer disable a namespace that a published driver depends on outside the publish/diff/rollback flow, the cluster-create workflow opens an initial draft containing the default namespaces instead of writing namespace rows directly, and the Admin UI Namespaces tab becomes hybrid (read-only over published, click-to-edit opens draft) like the UNS Structure tab; (3) ZTag/SAPID fleet-wide uniqueness moves from per-generation indexes (which silently allow rollback or re-enable to reintroduce duplicates) into a new ExternalIdReservation table that sits outside generation versioning, with sp_PublishGeneration reserving atomically via MERGE under transaction lock so a different EquipmentUuid attempting the same active value rolls the whole publish back, an FleetAdmin-only sp_ReleaseExternalIdReservation as the only path to free a value for reuse with audit trail, and a corresponding Release-reservation operator workflow in the Admin UI; (4) Equipment.EquipmentId is now system-generated as 'EQ-' + first 12 hex chars of EquipmentUuid, never operator-supplied or editable, removed from the Equipment CSV import schema entirely (rows match by EquipmentUuid for updates or create new equipment with auto-generated identifiers when no UUID is supplied), with a new Merge-or-Rebind-equipment operator workflow handling the rare case where two UUIDs need to be reconciled — closing the corruption path where typos and bulk-import renames were minting duplicate identities and breaking downstream UUID-keyed lineage. New decisions #122-125 with explicit "supersedes" notes for the earlier #107 (cluster-level namespace) and #116 (operator-set EquipmentId) frames they revise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:08:58 -04:00
Joseph Doherty
a1e79cdb06 Draft v2 multi-driver planning docs (docs/v2/) so Phase 0–5 work has a complete reference: rename to OtOpcUa, migrate to .NET 10 x64 (Galaxy stays .NET 4.8 x86 out-of-process), add seven new drivers behind composable capability interfaces (Modbus TCP / DL205, AB CIP, AB Legacy, S7, TwinCAT, FOCAS, OPC UA Client), introduce a central MSSQL config DB with cluster-scoped immutable generations and per-node credential binding, deploy as two-node site clusters with non-transparent redundancy and minimal per-node overrides, classify drivers by stability tier (A pure-managed / B wrapped-native / C out-of-process Windows service) with Tier C deep dives for both Galaxy and FOCAS, define per-driver test data sources (libplctag ab_server, Snap7, NModbus in-proc, TwinCAT XAR VM, FOCAS TCP stub plus native FaultShim) plus a 6-axis cross-driver test matrix, and ship a Blazor Server admin UI mirroring ScadaLink CentralUI's Bootstrap 5 / LDAP cookie auth / dark-sidebar look-and-feel — 106 numbered decisions across six docs (plan.md, driver-specs.md, driver-stability.md, test-data-sources.md, config-db-schema.md, admin-ui.md), DRAFT only and intentionally not yet wired to code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 09:17:49 -04:00