docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring
17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
This commit is contained in:
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.ConfigurationDatabase` |
|
| Module | `src/ScadaLink.ConfigurationDatabase` |
|
||||||
| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
|
| Design doc | `docs/requirements/Component-ConfigurationDatabase.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 3 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -37,6 +37,28 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
|
|||||||
`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`,
|
`ExternalSystemRepository`, `InboundApiRepository`, `NotificationRepository`,
|
||||||
`SiteRepository`, `InstanceLocator`) have little or no direct coverage.
|
`SiteRepository`, `InstanceLocator`) have little or no direct coverage.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
Re-reviewed the module at commit `39d737e`. All eleven findings from the initial
|
||||||
|
review (`9c60592`) remain `Resolved` — the secret-column encryption
|
||||||
|
(`EncryptedStringConverter` + `ApplySecretColumnEncryption`), the fail-fast no-arg
|
||||||
|
DI overload, the `AsSplitQuery` conversions, the audit cycle-safe serializer, and the
|
||||||
|
added repository test coverage all verified present and consistent with their
|
||||||
|
resolutions. Three new findings were recorded. The most material is that the
|
||||||
|
encryption work done for CD-004 left one bearer credential out of scope:
|
||||||
|
`ApiKey.KeyValue` — the inbound-API authentication secret — is still persisted in
|
||||||
|
plaintext (`ConfigurationDatabase-012`); it cannot use the same non-deterministic
|
||||||
|
Data Protection converter because authentication looks the key up *by value*, so it
|
||||||
|
needs a hash-based scheme instead. The second is a resilience gap in the encryption
|
||||||
|
plumbing itself: `ApplySecretColumnEncryption` silently substitutes a throwaway
|
||||||
|
`EphemeralDataProtectionProvider` whenever no provider is supplied, so any context
|
||||||
|
constructed via the single-argument constructor on a write path would encrypt
|
||||||
|
secrets with a key discarded at process exit, yielding permanently undecryptable
|
||||||
|
ciphertext with no error (`ConfigurationDatabase-013`). The third is a minor
|
||||||
|
inconsistency — a redundant cast on one of the three `HasConversion` calls
|
||||||
|
(`ConfigurationDatabase-014`). The module is otherwise healthy and the prior fixes
|
||||||
|
hold up well.
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
@@ -44,11 +66,11 @@ repositories (`TemplateEngineRepository`, `DeploymentManagerRepository`,
|
|||||||
| 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). |
|
| 1 | Correctness & logic bugs | ✓ | `GetTemplateWithChildrenAsync` discards loaded children (CD-001); `GetApprovedKeysForMethodAsync` CSV parsing is brittle (CD-008). |
|
||||||
| 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. |
|
| 2 | Akka.NET conventions | ✓ | No actors in this module; data-access layer only. No issues found. |
|
||||||
| 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. |
|
| 3 | Concurrency & thread safety | ✓ | DbContext correctly scoped; optimistic concurrency on `DeploymentRecord` correct. Repositories hold no shared mutable state. No issues found. |
|
||||||
| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003); audit JSON serialization failure handling (CD-007). |
|
| 4 | Error handling & resilience | ✓ | `WaitForDatabaseReadyAsync` is sound. No-arg DI overload fails late and silently (CD-003, resolved); audit JSON serialization failure handling (CD-007, resolved). Re-review: ephemeral Data Protection fallback can silently produce undecryptable ciphertext (CD-013). |
|
||||||
| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002); SMTP/DB-connection/auth secrets stored unencrypted (CD-004). |
|
| 5 | Security | ✓ | Hardcoded `sa` credential literal (CD-002, resolved); SMTP/DB-connection/auth secrets unencrypted (CD-004, resolved). Re-review: `ApiKey.KeyValue` bearer credential still stored in plaintext (CD-012). |
|
||||||
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009). No N+1 in audited paths. |
|
| 6 | Performance & resource management | ✓ | `GetAllTemplatesAsync` / `GetTemplateTreeAsync` eager-load multiple collections without `AsSplitQuery` (CD-009, resolved). No N+1 in audited paths. Re-review: no new issues. |
|
||||||
| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005); seed data uses `HasData` consistent with design. |
|
| 7 | Design-document adherence | ✓ | Audit `Id` type mismatch vs design doc (CD-005, resolved); seed data uses `HasData` consistent with design. Re-review: no new issues. |
|
||||||
| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006); inconsistent null-guard on injected context (CD-011). |
|
| 8 | Code organization & conventions | ✓ | Mostly clean. `Grpc*` address columns unbounded (CD-006, resolved); inconsistent null-guard on injected context (CD-011, resolved). Re-review: redundant/inconsistent cast on one `HasConversion` call (CD-014). |
|
||||||
| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
|
| 9 | Testing coverage | ✓ | Several repositories and `InstanceLocator` lack direct tests (CD-010). |
|
||||||
| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
|
| 10 | Documentation & comments | ✓ | `DeploymentManagerRepository` "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above. |
|
||||||
|
|
||||||
@@ -570,3 +592,128 @@ every data-access type behaves uniformly and a hand-constructed instance fails w
|
|||||||
informative exception at construction rather than a later `NullReferenceException`.
|
informative exception at construction rather than a later `NullReferenceException`.
|
||||||
Regression: `Constructor_NullContext_Throws` tests were added for all four affected types
|
Regression: `Constructor_NullContext_Throws` tests were added for all four affected types
|
||||||
(`InboundApiRepositoryTests.cs`, `RepositoryCoverageTests.cs`).
|
(`InboundApiRepositoryTests.cs`, `RepositoryCoverageTests.cs`).
|
||||||
|
|
||||||
|
### ConfigurationDatabase-012 — Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Security |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ConfigurationDatabase/Configurations/InboundApiConfiguration.cs:17-19` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`ApiKey.KeyValue` is the bearer credential presented in the `X-API-Key` header to
|
||||||
|
authenticate Inbound API requests (HighLevelReqs §7.2–7.3). It is mapped as an
|
||||||
|
ordinary `nvarchar(500)` column with a unique index and persisted verbatim. Anyone
|
||||||
|
with read access to the configuration database — or to any `AuditLogEntry.AfterStateJson`
|
||||||
|
into which an `ApiKey` entity is serialized — obtains live API credentials in cleartext.
|
||||||
|
|
||||||
|
`ConfigurationDatabase-004` introduced encryption-at-rest for the other secret-bearing
|
||||||
|
columns (SMTP credentials, external-system auth config, database connection strings)
|
||||||
|
but explicitly scoped `ApiKey.KeyValue` out. The omission is genuine: the
|
||||||
|
`EncryptedStringConverter` built for CD-004 is backed by ASP.NET Data Protection, which
|
||||||
|
is **non-deterministic** — the same plaintext encrypts to different ciphertext each
|
||||||
|
time — so it cannot be applied here, because `GetApprovedKeysForMethodAsync` and the
|
||||||
|
authentication path resolve a key by its value (`GetApiKeyByValueAsync` does
|
||||||
|
`FirstOrDefaultAsync(k => k.KeyValue == keyValue)`). A non-deterministic converter would
|
||||||
|
break that equality lookup. The result is that the one credential most exposed to
|
||||||
|
external callers is the one credential left unprotected.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Store a salted cryptographic hash of the key value instead of the plaintext (or
|
||||||
|
ciphertext): hash on create, and authenticate by hashing the presented key and
|
||||||
|
comparing. This keeps the equality lookup working (the hash is deterministic) while
|
||||||
|
ensuring the database never holds a usable credential. The plaintext key would then
|
||||||
|
only ever be shown once, at creation time, to the Admin who created it. This requires
|
||||||
|
a coordinated change with the Inbound API / Security components and the `ApiKey`
|
||||||
|
entity in Commons; record the chosen scheme in
|
||||||
|
`docs/requirements/Component-ConfigurationDatabase.md` and the Inbound API design doc.
|
||||||
|
At minimum, ensure `ApiKey` entities are never passed to `IAuditService` without the
|
||||||
|
key value redacted.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### ConfigurationDatabase-013 — Secret-column encryption silently falls back to an ephemeral (throwaway) key
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:107-124` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`ApplySecretColumnEncryption` resolves the Data Protection provider as
|
||||||
|
`_dataProtectionProvider ?? new EphemeralDataProtectionProvider()`. The `??` fallback
|
||||||
|
is reached whenever the context is constructed via the single-argument
|
||||||
|
`ScadaLinkDbContext(DbContextOptions)` constructor — i.e. whenever no provider was
|
||||||
|
injected. An `EphemeralDataProtectionProvider` generates a key ring that lives only in
|
||||||
|
process memory and is discarded at process exit.
|
||||||
|
|
||||||
|
For design-time `dotnet ef` tooling this is harmless (the XML remark correctly notes
|
||||||
|
it only emits schema). The risk is on a *runtime write path*. The runtime currently
|
||||||
|
gets the provider-bearing context only because `AddConfigurationDatabase` adds an
|
||||||
|
`AddScoped` factory registration that overrides EF's activator-based registration.
|
||||||
|
That override is the single thing standing between correct behaviour and silent data
|
||||||
|
corruption: any future change that resolves a `ScadaLinkDbContext` through a path the
|
||||||
|
override does not cover — an `AddPooledDbContextFactory`/`IDbContextFactory<ScadaLinkDbContext>`
|
||||||
|
registration, a second `AddDbContext` call, a hand-constructed context in server code —
|
||||||
|
would construct the context with the single-arg constructor, encrypt secret columns
|
||||||
|
with a throwaway key, and persist ciphertext that becomes **permanently undecryptable
|
||||||
|
the moment the process restarts**. There is no exception, no warning; the failure only
|
||||||
|
surfaces later as `CryptographicException` on read (mis-attributed by
|
||||||
|
`EncryptedStringConverter` to "the stored value was not written by this system").
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Do not silently substitute an ephemeral provider for write-capable contexts. Either:
|
||||||
|
(a) require the provider unconditionally and have design-time tooling pass an explicit
|
||||||
|
ephemeral provider so the intent is visible at the call site; or (b) keep the
|
||||||
|
single-arg constructor but mark contexts built without a real provider as
|
||||||
|
schema-only — e.g. record a flag and have the encrypting converter throw a clear
|
||||||
|
`InvalidOperationException` ("secret columns cannot be written without a configured
|
||||||
|
Data Protection key ring") on the first `Protect`, instead of producing throwaway
|
||||||
|
ciphertext. Also harden the DI wiring so a `ScadaLinkDbContext` cannot be resolved
|
||||||
|
through the EF-activator registration at all (e.g. register only the factory, or use
|
||||||
|
`AddDbContextFactory` with the explicit constructor).
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### ConfigurationDatabase-014 — Redundant, inconsistent cast on one `HasConversion` call
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Code organization & conventions |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ConfigurationDatabase/ScadaLinkDbContext.cs:121-123` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`ApplySecretColumnEncryption` calls `HasConversion(converter)` three times. The first
|
||||||
|
two (`SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`)
|
||||||
|
pass `converter` directly; the third (`DatabaseConnectionDefinition.ConnectionString`)
|
||||||
|
casts it to `(Microsoft.EntityFrameworkCore.Storage.ValueConversion.ValueConverter)`.
|
||||||
|
`EncryptedStringConverter` already derives from `ValueConverter<string?, string?>`
|
||||||
|
(itself a `ValueConverter`), and the first two call sites compile fine without the
|
||||||
|
cast, so the cast is redundant. The inconsistency makes the code look as though the
|
||||||
|
third call needs special handling when it does not, and the fully-qualified type name
|
||||||
|
inline adds noise.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Remove the cast so all three calls read identically as `HasConversion(converter)`.
|
||||||
|
If a `ValueConverter`-typed reference is genuinely wanted, give it a local variable of
|
||||||
|
that type once and use it for all three.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.DataConnectionLayer` |
|
| Module | `src/ScadaLink.DataConnectionLayer` |
|
||||||
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
|
| Design doc | `docs/requirements/Component-DataConnectionLayer.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 4 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -30,20 +30,40 @@ the design doc's failover state machine and the implemented unstable-disconnect
|
|||||||
heuristic. Test coverage is adequate for the happy paths and failover but absent for
|
heuristic. Test coverage is adequate for the happy paths and failover but absent for
|
||||||
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
|
tag-resolution retry, disconnect/re-subscribe, and concurrency around `HandleSubscribe`.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
All 13 findings from the 2026-05-16 review remain `Resolved` and the fixes were
|
||||||
|
verified in place against the current source (`PipeTo(Self)` subscribe pattern,
|
||||||
|
`Resume` supervision, `ConcurrentDictionary` callback maps, atomic disconnect guards,
|
||||||
|
bounded write timeout, etc.). The re-review walked all 10 checklist categories again
|
||||||
|
and found **4 new findings**: one **High** — the DCL-012 security warning is never
|
||||||
|
seen in production because `RealOpcUaClientFactory.Create()` constructs
|
||||||
|
`RealOpcUaClient` with no logger, so the warning sinks into `NullLogger`; one
|
||||||
|
**Medium** — initial-connect failures in the `Connecting` state never count toward
|
||||||
|
failover, so a connection whose primary endpoint is unreachable at startup retries the
|
||||||
|
primary forever and never tries the configured backup; one **Medium** —
|
||||||
|
`HandleSubscribeCompleted` always replies `SubscribeTagsResponse(success: true)` even
|
||||||
|
when a connection-level subscribe failure is driving the actor into `Reconnecting`,
|
||||||
|
telling the Instance Actor the subscribe succeeded when it did not; and one **Low** —
|
||||||
|
`WriteBatchAsync` does not catch the `InvalidOperationException` from `EnsureConnected`,
|
||||||
|
so a mid-batch disconnect aborts the whole write batch (the same class of defect
|
||||||
|
DCL-007 fixed for `ReadBatchAsync`). New findings are numbered from
|
||||||
|
`DataConnectionLayer-014`.
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
|---|----------|----------|-------|
|
|---|----------|----------|-------|
|
||||||
| 1 | Correctness & logic bugs | x | `_resolvedTags` double-counting and stale counters after failover; `ReadBatchAsync` aborts mid-batch. |
|
| 1 | Correctness & logic bugs | x | 2026-05-16 findings resolved. Re-review: finding 016 — `SubscribeTagsResponse` reports success on a connection-level subscribe failure. |
|
||||||
| 2 | Akka.NET conventions | x | `Task.Run` mutating actor state (critical); `Restart` supervision loses state; closures capturing `_subscriptionsByInstance`. |
|
| 2 | Akka.NET conventions | x | 2026-05-16 findings resolved (`PipeTo(Self)` subscribe, `Resume` supervision). Re-review: no new issues. |
|
||||||
| 3 | Concurrency & thread safety | x | Actor state mutated off the actor thread; `RealOpcUaClient` callback dictionary unsynchronized. |
|
| 3 | Concurrency & thread safety | x | 2026-05-16 findings resolved (`ConcurrentDictionary`, atomic disconnect guards). Re-review: no new issues. |
|
||||||
| 4 | Error handling & resilience | x | Subscription failures not surfaced; unbounded write with no timeout; reconnect after subscribe-time failure not handled. |
|
| 4 | Error handling & resilience | x | Re-review: finding 015 — initial-connect failures never trigger failover; finding 017 — `WriteBatchAsync` aborts on mid-batch disconnect. |
|
||||||
| 5 | Security | x | `AutoAcceptUntrustedCerts` defaults to `true`; OPC UA password handling acceptable. See finding 012. |
|
| 5 | Security | x | Re-review: finding 014 — the DCL-012 auto-accept-cert warning is never logged in production (`RealOpcUaClient` built without a logger). |
|
||||||
| 6 | Performance & resource management | x | `HandleUnsubscribe` O(n^2) over instances; initial-read loop serial per tag. |
|
| 6 | Performance & resource management | x | 2026-05-16 finding 008 resolved (reverse index). Re-review: no new issues. |
|
||||||
| 7 | Design-document adherence | x | Failover heuristic (unstable-disconnect count) differs from documented state machine; `WriteTimeout` documented but unused. |
|
| 7 | Design-document adherence | x | 2026-05-16 findings 005/009 resolved. Re-review: no new issues (finding 015 logged under resilience). |
|
||||||
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
|
| 8 | Code organization & conventions | x | No issues found — POCOs in Commons, options class owned by component, factory pattern consistent. |
|
||||||
| 9 | Testing coverage | x | No tests for tag-resolution retry, disconnect/re-subscribe, bad-quality push, or `HandleSubscribe` concurrency. |
|
| 9 | Testing coverage | x | DCL001–013 regression tests present. Re-review: gaps remain for finding 014/015/016 scenarios (no test for production logger wiring, startup failover, or subscribe-response-on-failure). |
|
||||||
| 10 | Documentation & comments | x | XML comment on `RaiseDisconnected` claims thread safety it does not have; design doc round-robin description stale. |
|
| 10 | Documentation & comments | x | 2026-05-16 finding 013 resolved. Re-review: no new issues. |
|
||||||
|
|
||||||
## Findings
|
## Findings
|
||||||
|
|
||||||
@@ -661,3 +681,179 @@ fanning 32 barrier-synchronised threads that raise the client's `ConnectionLost`
|
|||||||
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
|
simultaneously, and asserts `Disconnected` fires exactly once per round; against a
|
||||||
non-atomic check-then-set it double-fires (verified by temporarily reverting the
|
non-atomic check-then-set it double-fires (verified by temporarily reverting the
|
||||||
guard), and it passes against the atomic fix.
|
guard), and it passes against the atomic fix.
|
||||||
|
|
||||||
|
### DataConnectionLayer-014 — DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | High |
|
||||||
|
| Category | Security |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:325`, `src/ScadaLink.DataConnectionLayer/Adapters/RealOpcUaClient.cs:35-39,79-83` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
Finding DataConnectionLayer-012 was resolved in part by adding a prominent
|
||||||
|
`ILogger` warning in `RealOpcUaClient.ConnectAsync` whenever the auto-accept
|
||||||
|
certificate validator is installed (`RealOpcUaClient.cs:79-83`). The
|
||||||
|
`ILogger<RealOpcUaClient>` constructor parameter was made optional, defaulting to
|
||||||
|
`NullLogger<RealOpcUaClient>.Instance` (`RealOpcUaClient.cs:35-39`).
|
||||||
|
|
||||||
|
However, the only production code path that constructs a `RealOpcUaClient` is
|
||||||
|
`RealOpcUaClientFactory.Create()` (`RealOpcUaClient.cs:325`), which calls
|
||||||
|
`new RealOpcUaClient(_globalOptions)` and passes **no logger**. The factory itself
|
||||||
|
holds only an `OpcUaGlobalOptions` and has no `ILoggerFactory`/`ILogger` available.
|
||||||
|
As a result the `_logger` field is always `NullLogger` for every real OPC UA
|
||||||
|
connection, and the man-in-the-middle warning the DCL-012 fix added is silently
|
||||||
|
discarded. An operator who deploys a connection with `AutoAcceptUntrustedCerts`
|
||||||
|
enabled — accepting any server certificate on an industrial control link — gets no
|
||||||
|
visible signal anywhere in the logs. The in-scope half of DCL-012's resolution is
|
||||||
|
therefore not actually effective in production; only the unit test
|
||||||
|
(`DCL012_OpcUaConnectionOptions_AutoAcceptUntrustedCerts_DefaultsToFalse`, which only
|
||||||
|
checks the default value) passes.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Thread a real logger through to `RealOpcUaClient`. `DataConnectionFactory` already
|
||||||
|
holds an `ILoggerFactory` and constructs `RealOpcUaClientFactory(globalOptions)` —
|
||||||
|
give `RealOpcUaClientFactory` an `ILoggerFactory` (or an `ILogger<RealOpcUaClient>`)
|
||||||
|
constructor parameter and pass `_loggerFactory.CreateLogger<RealOpcUaClient>()` into
|
||||||
|
each `new RealOpcUaClient(...)`. Add a test that asserts the warning is emitted on a
|
||||||
|
real connect with auto-accept enabled (e.g. via a captured `ILogger`), not just that
|
||||||
|
the default is `false`.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### DataConnectionLayer-015 — Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:404-417`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:419-493` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
Failover between the primary and backup endpoints is implemented in two places, both
|
||||||
|
reachable only after a connection has already been `Connected` at least once:
|
||||||
|
`HandleReconnectResult` (Reconnecting state) counts `_consecutiveFailures` and switches
|
||||||
|
endpoint, and `BecomeReconnecting` counts `_consecutiveUnstableDisconnects`.
|
||||||
|
|
||||||
|
`HandleConnectResult` — the handler for the *initial* connection attempt in the
|
||||||
|
`Connecting` state (`DataConnectionActor.cs:404-417`) — does neither. On failure it
|
||||||
|
only logs and re-arms the reconnect timer with `AttemptConnect`; it never increments
|
||||||
|
`_consecutiveFailures`, never consults `_backupConfig`, and never switches endpoint.
|
||||||
|
|
||||||
|
Consequence: if the primary endpoint is unreachable when the connection actor first
|
||||||
|
starts — which is the common case after a fresh artifact deployment, a site restart,
|
||||||
|
or a primary that is simply down at that moment — the actor retries the *primary*
|
||||||
|
endpoint indefinitely at `ReconnectInterval` and **never** attempts the configured
|
||||||
|
backup. The design doc's endpoint-redundancy promise ("automatic failover when the
|
||||||
|
active endpoint becomes unreachable") is silently not honoured for the
|
||||||
|
never-connected-yet case, and an operator sees a connection stuck `Connecting` forever
|
||||||
|
despite a healthy backup being configured.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Make `HandleConnectResult` participate in the failover counter the same way
|
||||||
|
`HandleReconnectResult` does: increment `_consecutiveFailures` on failure and, when
|
||||||
|
`_backupConfig != null && _consecutiveFailures >= _failoverRetryCount`, perform the
|
||||||
|
endpoint switch (dispose adapter, create the other adapter, bump `_adapterGeneration`,
|
||||||
|
log the failover event) before re-arming the timer. Alternatively, fold the initial
|
||||||
|
connect into the same reconnect path so there is a single failover decision point. Add
|
||||||
|
a regression test for "primary down at startup, backup configured → fails over to
|
||||||
|
backup".
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### DataConnectionLayer-016 — `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:606,666-672`, `src/ScadaLink.DataConnectionLayer/Actors/DataConnectionActor.cs:232-240` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`HandleSubscribeCompleted` computes `connectionLevelFailure` (line 606) and returns it
|
||||||
|
so the `Connected`-state handler can drive the actor into `Reconnecting`
|
||||||
|
(`DataConnectionActor.cs:232-240`). But before returning, it unconditionally replies
|
||||||
|
to the caller with `new SubscribeTagsResponse(..., true, null, ...)` (lines 666-667) —
|
||||||
|
`Success: true`, `Error: null` — regardless of whether any tag failed at connection
|
||||||
|
level.
|
||||||
|
|
||||||
|
So when a subscribe arrives while the adapter is silently down, the Instance Actor is
|
||||||
|
told the subscribe **succeeded**, while the connection actor simultaneously transitions
|
||||||
|
to `Reconnecting`. The tags were never actually subscribed at the adapter (the catch
|
||||||
|
block recorded `Success: false`); they are recovered later by `ReSubscribeAll` only if
|
||||||
|
and when reconnection succeeds. The caller has no way to distinguish "subscribed and
|
||||||
|
healthy" from "accepted, but the connection is currently down" — a misleading
|
||||||
|
success signal on a request that did not do what the response claims.
|
||||||
|
|
||||||
|
(Genuine tag-resolution failures are arguably also reported as overall `true`, but
|
||||||
|
that is defensible: those tags are tracked in `_unresolvedTags` and the design models
|
||||||
|
unresolved tags as a runtime quality concern, with a `Bad`-quality `TagValueUpdate`
|
||||||
|
already pushed. The connection-level case is the clear defect because the actor itself
|
||||||
|
treats it as a failure worth a state transition.)
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
When `connectionLevelFailure` is true, reply with
|
||||||
|
`SubscribeTagsResponse(..., success: false, error: "connection unavailable — will
|
||||||
|
re-subscribe on reconnect", ...)` (or an equivalent), so the caller's response matches
|
||||||
|
the actor's own assessment. Optionally carry per-tag outcomes in the response so the
|
||||||
|
Instance Actor can reflect partial success. Add a test asserting the response is not
|
||||||
|
`Success: true` when a connection-level subscribe failure drives `Reconnecting`.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### DataConnectionLayer-017 — `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:229-237`, `src/ScadaLink.DataConnectionLayer/Adapters/OpcUaDataConnection.cs:218-227` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`WriteBatchAsync` loops calling `WriteAsync` per tag (`OpcUaDataConnection.cs:229-237`).
|
||||||
|
`WriteAsync` returns a `WriteResult` for OPC-UA-level write rejections (good — a bad
|
||||||
|
status does not abort the batch), but it first calls `EnsureConnected()`
|
||||||
|
(`OpcUaDataConnection.cs:220`), which throws `InvalidOperationException` when the
|
||||||
|
client is disconnected. `WriteBatchAsync` does not catch that exception, so if the
|
||||||
|
connection drops partway through a batch the whole `WriteBatchAsync` throws and the
|
||||||
|
caller gets no result map — losing the per-tag outcomes for the tags that already
|
||||||
|
wrote. This is the same class of defect that DataConnectionLayer-007 fixed for
|
||||||
|
`ReadBatchAsync` (which now records a failed `ReadResult` per failing tag and only
|
||||||
|
propagates `OperationCanceledException`). `WriteBatchAsync` feeds
|
||||||
|
`WriteBatchAndWaitAsync` (line 246), so a disconnect during a flag-and-wait write
|
||||||
|
sequence surfaces as an unhandled exception rather than a clean `false`/per-tag result.
|
||||||
|
|
||||||
|
Severity is Low because device writes are real-time control operations with no
|
||||||
|
store-and-forward, the batch write paths are not on the primary `HandleWrite` hot path
|
||||||
|
(`HandleWrite` calls single-tag `WriteAsync`), and a disconnect mid-batch is itself an
|
||||||
|
error condition — but the inconsistent error shape (exception vs. per-tag result) is a
|
||||||
|
maintainability and correctness wart.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Mirror the DCL-007 fix: wrap the per-tag `WriteAsync` call in `WriteBatchAsync` in a
|
||||||
|
try/catch that records a failed `WriteResult(false, ex.Message)` for the failing tag
|
||||||
|
and continues, while still propagating `OperationCanceledException` to abort a
|
||||||
|
cancelled batch as a whole. This gives callers (including `WriteBatchAndWaitAsync`) a
|
||||||
|
complete, consistent result map.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.DeploymentManager` |
|
| Module | `src/ScadaLink.DeploymentManager` |
|
||||||
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
|
| Design doc | `docs/requirements/Component-DeploymentManager.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 3 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -30,20 +30,43 @@ detail. Configuration is not bound to `appsettings.json`, leaving one option
|
|||||||
entirely dead. Test coverage stops at the communication boundary and never
|
entirely dead. Test coverage stops at the communication boundary and never
|
||||||
exercises a successful deployment or the lifecycle success paths.
|
exercises a successful deployment or the lifecycle success paths.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
Re-reviewed at commit `39d737e` after the batch of fixes for
|
||||||
|
DeploymentManager-001..014. All fourteen prior findings remain `Resolved` and
|
||||||
|
verified against source — the broadened catch, non-cancellable cleanup writes,
|
||||||
|
ref-counted `OperationLockManager`, query-before-redeploy reconciliation,
|
||||||
|
structured diff, options binding, and the expanded TestKit-actor test suite are
|
||||||
|
all present and correct. The module is in markedly better shape than the
|
||||||
|
first review: error paths are now defensively handled and test coverage is
|
||||||
|
broad (successful deploy/lifecycle, lock serialization, reconciliation
|
||||||
|
matrix, artifact per-site matrix).
|
||||||
|
|
||||||
|
This re-review found **3 new findings**, all clustered on the
|
||||||
|
DeploymentManager-006 reconciliation path added since the last review. The
|
||||||
|
reconciliation shortcut (`TryReconcileWithSiteAsync`) marks a stale prior
|
||||||
|
record `Success` when the site already has the target revision, but it does
|
||||||
|
**not** perform the side effects the normal success path does — it never
|
||||||
|
updates the instance `State`, never refreshes the `DeployedConfigSnapshot`,
|
||||||
|
and never corrects the prior record's own `RevisionHash` (DeploymentManager-015,
|
||||||
|
DeploymentManager-016). The `GetDeploymentStatusAsync` XML doc is now stale —
|
||||||
|
it still describes the query-before-redeploy behaviour that actually moved into
|
||||||
|
`TryReconcileWithSiteAsync` (DeploymentManager-017).
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
|---|----------|----------|-------|
|
|---|----------|----------|-------|
|
||||||
| 1 | Correctness & logic bugs | ✓ | Stuck `InProgress` record on unexpected exception; cancelled-token failure write. |
|
| 1 | Correctness & logic bugs | ✓ | Re-review 2026-05-17: reconciliation skips instance-state/snapshot updates (DeploymentManager-015) and keeps a stale `RevisionHash` (DeploymentManager-016). Prior: stuck `InProgress` / cancelled-token write (resolved). |
|
||||||
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
|
| 2 | Akka.NET conventions | ✓ | Module is a plain service layer; it calls `CommunicationService` which wraps Ask. No actors here. No issues. |
|
||||||
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` is sound but leaks semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. |
|
| 3 | Concurrency & thread safety | ✓ | `OperationLockManager` ref-counts and reclaims semaphores; `DeployToAllSitesAsync` correctly builds commands sequentially before parallel send. No issues at re-review. |
|
||||||
| 4 | Error handling & resilience | ✓ | Several gaps — see DeploymentManager-001/002/003/004. |
|
| 4 | Error handling & resilience | ✓ | Prior gaps DeploymentManager-001/002/003/004 resolved and verified. No new issues. |
|
||||||
| 5 | Security | ✓ | SMTP credentials are serialized and broadcast to sites — see DeploymentManager-013. No injection vectors; no authz here (enforced upstream). |
|
| 5 | Security | ✓ | SMTP credential handling documented as an accepted design decision (DeploymentManager-013). No injection vectors; no authz here (enforced upstream). No new issues. |
|
||||||
| 6 | Performance & resource management | ✓ | Semaphore leak (DeploymentManager-005); artifact rebuild does N+1 method queries per external system. |
|
| 6 | Performance & resource management | ✓ | Semaphore leak resolved (DeploymentManager-005). No new issues. |
|
||||||
| 7 | Design-document adherence | ✓ | Missing query-before-redeploy (DeploymentManager-006); Diff View not implemented (DeploymentManager-007). |
|
| 7 | Design-document adherence | ✓ | Query-before-redeploy and Diff View implemented (DeploymentManager-006/007). Re-review: reconciliation path breaks the deployed-snapshot/instance-state invariants — see DeploymentManager-015. |
|
||||||
| 8 | Code organization & conventions | ✓ | Options class not bound to configuration — DeploymentManager-008. POCO/repo placement correct. |
|
| 8 | Code organization & conventions | ✓ | Options binding resolved (DeploymentManager-008). POCO/repo placement correct. No new issues. |
|
||||||
| 9 | Testing coverage | ✓ | No successful-deploy test, no lifecycle success test — DeploymentManager-011; dead `CreateCommand` helper — DeploymentManager-014. |
|
| 9 | Testing coverage | ✓ | Broad coverage added (success, lifecycle, lock serialization, reconciliation, artifact matrix). Re-review: reconciled-success path's missing side effects (DeploymentManager-015) are untested. |
|
||||||
| 10 | Documentation & comments | ✓ | Misleading timeout comment — DeploymentManager-009; stale option XML doc — DeploymentManager-012. |
|
| 10 | Documentation & comments | ✓ | Prior comment findings resolved. Re-review: `GetDeploymentStatusAsync` XML doc is now stale — DeploymentManager-017. |
|
||||||
|
|
||||||
## Findings
|
## Findings
|
||||||
|
|
||||||
@@ -710,3 +733,126 @@ the communication boundary. New tests:
|
|||||||
`DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
|
`DeployToAllSitesAsync_AllPerSiteCommandsShareTheSummaryDeploymentId` (also
|
||||||
covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
|
covers DeploymentManager-010), `DeployToAllSitesAsync_PartialFailure_ReportsPerSiteMatrix`
|
||||||
(per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
|
(per-site success/failure matrix), `RetryForSiteAsync_SiteSucceeds_ReturnsSuccessAndAudits`.
|
||||||
|
|
||||||
|
### DeploymentManager-015 — Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | High |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:631-655` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`TryReconcileWithSiteAsync` (the DeploymentManager-006 query-before-redeploy
|
||||||
|
path) handles the case where a prior `InProgress`/timeout-`Failed` record exists
|
||||||
|
and the site reports it already has the target revision hash. In that case it
|
||||||
|
marks the prior `DeploymentRecord` `Success`, audit-logs `DeployReconciled`, and
|
||||||
|
returns it — the caller then returns `Result.Success` and **never enters the
|
||||||
|
normal deploy body**.
|
||||||
|
|
||||||
|
The normal success path (`DeployInstanceAsync.cs:215-223`) does three things on
|
||||||
|
a successful site response: writes the deployment record terminal status, sets
|
||||||
|
`instance.State = InstanceState.Enabled` + `UpdateInstanceAsync`, and calls
|
||||||
|
`StoreDeployedSnapshotAsync`. The reconciliation shortcut performs only the
|
||||||
|
first. Consequently, after a reconciled deployment:
|
||||||
|
|
||||||
|
- The instance `State` is left at whatever it was (e.g. `NotDeployed` for a
|
||||||
|
first-time deploy that timed out, or `Disabled`) even though the site is
|
||||||
|
actually running the configuration — the central state machine and the site
|
||||||
|
diverge, and a subsequent `DisableInstanceAsync`/`EnableInstanceAsync` will be
|
||||||
|
rejected or allowed incorrectly by `StateTransitionValidator`.
|
||||||
|
- No `DeployedConfigSnapshot` is created or refreshed. A first-time deploy that
|
||||||
|
is resolved purely by reconciliation leaves `GetDeploymentComparisonAsync`
|
||||||
|
permanently returning `"No deployed snapshot found for this instance."`, and a
|
||||||
|
redeploy reconciliation leaves the stored snapshot showing the *old* config
|
||||||
|
even though the deployment record claims `Success` for the new revision.
|
||||||
|
|
||||||
|
The design ("Deployed vs. Template-Derived State", WP-4/WP-8) requires the
|
||||||
|
deployed snapshot and instance state to reflect the last successful deployment;
|
||||||
|
the reconciliation path silently breaks both invariants.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
In the reconciled-success branch of `TryReconcileWithSiteAsync`, perform the
|
||||||
|
same post-success side effects as the normal path: set `instance.State =
|
||||||
|
InstanceState.Enabled` (+ `UpdateInstanceAsync`) and call
|
||||||
|
`StoreDeployedSnapshotAsync` with the target deployment ID / revision hash /
|
||||||
|
config JSON. Factor the shared post-success logic into one helper so the normal
|
||||||
|
and reconciliation paths cannot drift. Add a regression test asserting that a
|
||||||
|
reconciled deployment leaves the instance `Enabled` and a snapshot stored.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### DeploymentManager-016 — Reconciled prior record keeps its stale `RevisionHash`
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:639-651` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
When `TryReconcileWithSiteAsync` reconciles a prior record, it mutates
|
||||||
|
`prior.Status`, `prior.ErrorMessage`, and `prior.CompletedAt`, but **not**
|
||||||
|
`prior.RevisionHash`. The reconciliation condition only compares the *site's*
|
||||||
|
`AppliedRevisionHash` against the *freshly-flattened* `targetRevisionHash` — it
|
||||||
|
does not require `prior.RevisionHash` to equal either of them.
|
||||||
|
|
||||||
|
The prior record can legitimately carry a different revision hash than the
|
||||||
|
current target: e.g. a deploy timed out at revision `R1`, the template was then
|
||||||
|
edited so the current flatten yields `R2`, and meanwhile the site actually
|
||||||
|
applied `R2` through some other path (or `R1` and `R2` are equal-by-content but
|
||||||
|
the prior record predates a hash recompute). After reconciliation the record's
|
||||||
|
`Status` is `Success` but its `RevisionHash` still says `R1`, so staleness
|
||||||
|
checks and any UI that reads `DeploymentRecord.RevisionHash` will report the
|
||||||
|
instance as deployed at the wrong revision. The audit `DeployReconciled` entry
|
||||||
|
records `RevisionHash = targetRevisionHash`, contradicting the persisted record.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
In the reconciled-success branch, also set `prior.RevisionHash =
|
||||||
|
targetRevisionHash` so the persisted record, the audit entry, and the site's
|
||||||
|
actual applied revision all agree. Alternatively, only reconcile when
|
||||||
|
`prior.RevisionHash == targetRevisionHash` and otherwise fall through to a
|
||||||
|
normal deploy.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### DeploymentManager-017 — `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Documentation & comments |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:562-570` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
The XML summary on `GetDeploymentStatusAsync` reads: *"WP-2: After
|
||||||
|
failover/timeout, query site for current deployment state before
|
||||||
|
re-deploying."* The method body does no such thing — it is a one-line
|
||||||
|
pass-through to `_repository.GetDeploymentByDeploymentIdAsync`, a pure local DB
|
||||||
|
read. The query-the-site-before-redeploy behaviour the comment describes was
|
||||||
|
implemented separately in `TryReconcileWithSiteAsync` (DeploymentManager-006).
|
||||||
|
The stale comment is a leftover of the original design intent and misleads a
|
||||||
|
reader into thinking this method contacts the site.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Reword the summary to describe what the method actually does — "returns the
|
||||||
|
current persisted `DeploymentRecord` for the given deployment ID from the
|
||||||
|
configuration database" — and, if useful, cross-reference
|
||||||
|
`TryReconcileWithSiteAsync` as the place the site-query reconciliation lives.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.ExternalSystemGateway` |
|
| Module | `src/ScadaLink.ExternalSystemGateway` |
|
||||||
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
| Design doc | `docs/requirements/Component-ExternalSystemGateway.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 3 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -30,20 +30,41 @@ Test coverage is thin — `CachedCall` transient/buffering paths and `DatabaseGa
|
|||||||
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
|
are entirely untested. Themes: incomplete wiring against the S&F engine, and design-doc
|
||||||
requirements (timeout, retry settings) that are declared but not implemented.
|
requirements (timeout, retry settings) that are declared but not implemented.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
All fourteen prior findings remain `Resolved`; the resolutions for findings 001–014
|
||||||
|
were spot-checked against the current source and hold. The re-review walked the full
|
||||||
|
10-category checklist again and surfaced **three new findings**. The most serious
|
||||||
|
(`ExternalSystemGateway-015`, High) is a regression *introduced by* the
|
||||||
|
`ExternalSystemGateway-004` resolution: `CachedCall`/`CachedWrite` now pass a
|
||||||
|
per-system/per-connection `MaxRetries` of `0` through verbatim, but
|
||||||
|
`StoreAndForwardService.RetryMessageAsync` interprets a stored `MaxRetries` of `0` as
|
||||||
|
**retry forever**, not "never retry" — so the very `0` the ESG-004 fix claims to
|
||||||
|
"honour as never retry" actually produces an unbounded retry loop, and two ESG tests
|
||||||
|
assert the broken behaviour. `ExternalSystemGateway-016` (Medium) is that the
|
||||||
|
`ExternalSystemGateway-013` resolution used `ConfigureHttpClientDefaults`, which is a
|
||||||
|
**process-global** registration — it forces a `SocketsHttpHandler` (capped at the ESG
|
||||||
|
option) onto every `HttpClient` in the host, including the Notification Service's
|
||||||
|
OAuth2 token client, not just the gateway's per-system clients. `ExternalSystemGateway-017`
|
||||||
|
(Low) is a trailing-`?` URL nit when a GET method's parameters are all null. Theme:
|
||||||
|
both substantive findings are second-order defects in earlier fixes — the earlier
|
||||||
|
resolutions did not verify the downstream contract of the S&F engine they integrate
|
||||||
|
with.
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
|---|----------|----------|-------|
|
|---|----------|----------|-------|
|
||||||
| 1 | Correctness & logic bugs | ☑ | URL building edge cases, dropped S&F result, classification gaps — findings 003, 006, 009. |
|
| 1 | Correctness & logic bugs | ☑ | Prior: URL edge cases, dropped S&F result, classification — 003, 006, 009. Re-review: `MaxRetries == 0` retry-forever vs never-retry contradiction (015); trailing-`?` URL nit (017). |
|
||||||
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found in this module. |
|
| 2 | Akka.NET conventions | ☑ | No actors in this module; `AddExternalSystemGatewayActors` is a no-op. Blocking-I/O isolation is delegated to Site Runtime. No issues found. |
|
||||||
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; `ExternalCallResult.Response` lazy-parse is not thread-safe but instances are single-use. No findings raised. |
|
| 3 | Concurrency & thread safety | ☑ | Services are stateless and DI-scoped; the S&F delivery handlers resolve in a fresh DI scope on the sweep thread. No findings raised. |
|
||||||
| 4 | Error handling & resilience | ☑ | S&F handler never registered, double-dispatch, timeout not applied, cancellation conflation — findings 001, 002, 003, 008. |
|
| 4 | Error handling & resilience | ☑ | Prior: handler registration, double-dispatch, timeout, cancellation — 001, 002, 003, 008. Re-review: the unbounded-retry consequence of finding 015 is also a resilience defect (recorded under category 1). |
|
||||||
| 5 | Security | ☑ | Auth secrets logged-safe, but error bodies echoed verbatim — finding 007. |
|
| 5 | Security | ☑ | Error bodies now truncated (007). No new issues — auth secrets not logged, body capped. |
|
||||||
| 6 | Performance & resource management | ☑ | `HttpRequestMessage`/`HttpResponseMessage` and failed `SqlConnection` not disposed; full repository scan per call — findings 005, 010, 011. |
|
| 6 | Performance & resource management | ☑ | Prior: disposal and repository-scan findings 005, 010, 011 — all resolved and verified. No new issues. |
|
||||||
| 7 | Design-document adherence | ☑ | Timeout, retry settings, audit logging gaps — findings 002, 004, 012. |
|
| 7 | Design-document adherence | ☑ | Prior: timeout, retry settings, logging — 002, 004, 012. Re-review: finding 015 is also a design-adherence gap (S&F retry contract); recorded under category 1. |
|
||||||
| 8 | Code organization & conventions | ☑ | Options class correctly owned by module; `MaxConcurrentConnectionsPerSystem` unused — finding 013. |
|
| 8 | Code organization & conventions | ☑ | Prior: `MaxConcurrentConnectionsPerSystem` wiring — 013. Re-review: that wiring uses process-global `ConfigureHttpClientDefaults`, leaking the ESG cap onto every host `HttpClient` — finding 016. |
|
||||||
| 9 | Testing coverage | ☑ | CachedCall buffering and DatabaseGateway untested — finding 014. |
|
| 9 | Testing coverage | ☑ | Coverage is broad after finding 014. Re-review note: the `ZeroMaxRetries...` tests assert the persisted column, not the sweep outcome, and so lock in the finding-015 defect. |
|
||||||
| 10 | Documentation & comments | ☑ | XML docs reference WP numbers; permanent-failure logging requirement unverified — folded into finding 012. |
|
| 10 | Documentation & comments | ☑ | Inline comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101` assert a "never retry" semantic that the code does not deliver — see finding 015. |
|
||||||
|
|
||||||
## Findings
|
## Findings
|
||||||
|
|
||||||
@@ -760,3 +781,144 @@ headers and body so URL/auth/body construction is now verified, not just status
|
|||||||
These are new-coverage tests against already-correct behaviour, so they pass on the
|
These are new-coverage tests against already-correct behaviour, so they pass on the
|
||||||
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
|
current source; the `BuildUrl` and `ApplyAuth` paths they exercise are now protected
|
||||||
against regression.
|
against regression.
|
||||||
|
|
||||||
|
### ExternalSystemGateway-015 — `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | High |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:120-127`, `src/ScadaLink.ExternalSystemGateway/DatabaseGateway.cs:102-108` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
The `ExternalSystemGateway-004` fix removed the `MaxRetries > 0 ? ... : null` guard so
|
||||||
|
that `CachedCallAsync` and `CachedWriteAsync` now pass the definition's `MaxRetries`
|
||||||
|
to `StoreAndForwardService.EnqueueAsync` verbatim. The stated rationale (and the inline
|
||||||
|
comments at `ExternalSystemClient.cs:118-119` / `DatabaseGateway.cs:99-101`, plus the
|
||||||
|
tests `CachedCall_TransientFailure_ZeroMaxRetriesIsHonouredNotTreatedAsUnset` and
|
||||||
|
`CachedWrite_ZeroMaxRetriesIsHonouredNotTreatedAsUnset`) is that a configured
|
||||||
|
`MaxRetries` of `0` means **"never retry"**.
|
||||||
|
|
||||||
|
That is the opposite of what the Store-and-Forward engine actually does with the
|
||||||
|
value. `EnqueueAsync` stores the passed `maxRetries` directly into
|
||||||
|
`StoreAndForwardMessage.MaxRetries` (`StoreAndForwardService.cs:139`), whose own XML
|
||||||
|
doc states **"Maximum retry-sweep attempts before parking (0 = no limit)"**
|
||||||
|
(`StoreAndForwardMessage.cs:30`). The retry sweep enforces it as
|
||||||
|
`if (message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries)`
|
||||||
|
(`StoreAndForwardService.cs:285`) — when `MaxRetries == 0` that guard is false on
|
||||||
|
every sweep, so the message is **never parked and is retried forever**.
|
||||||
|
|
||||||
|
Consequences for a `CachedCall`/`CachedWrite` against a system or connection
|
||||||
|
configured with `MaxRetries = 0`:
|
||||||
|
|
||||||
|
1. A transiently-failing message that the operator intended never to retry is instead
|
||||||
|
retried on every sweep indefinitely, accumulating in the buffer and repeatedly
|
||||||
|
re-dispatching the request — the exact unbounded-retry / duplicate-delivery hazard
|
||||||
|
the idempotency note in the design doc warns about.
|
||||||
|
2. The two ESG regression tests cited above assert `max_retries == 0` is *stored* and
|
||||||
|
describe that as "honoured" — they verify the persisted column, never the resulting
|
||||||
|
sweep behaviour, so they lock in the broken semantics.
|
||||||
|
3. Because the SiteRuntime repository still always supplies `MaxRetries == 0` (the
|
||||||
|
open companion gap noted in `ExternalSystemGateway-004`), **every** cached call and
|
||||||
|
cached write at every site currently buffers as retry-forever. Before the ESG-004
|
||||||
|
fix the `> 0` guard sent `null`, so the S&F `DefaultMaxRetries` (a bounded value)
|
||||||
|
applied — i.e. the ESG-004 fix turned a bounded retry into an unbounded one for the
|
||||||
|
common path.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Reconcile the ESG and S&F interpretations of `MaxRetries == 0` — they must agree.
|
||||||
|
Either: (a) treat the entity's `MaxRetries == 0` as "unset" and pass `null` so the
|
||||||
|
bounded S&F default applies (reverting to the pre-ESG-004 behaviour, and accepting
|
||||||
|
that "never retry" then needs a different representation such as a nullable field or a
|
||||||
|
`MaxRetries == 1` convention); or (b) if `0` genuinely must mean "never retry", add an
|
||||||
|
explicit no-retry path — e.g. do not enqueue at all on transient failure when
|
||||||
|
`MaxRetries == 0`, or introduce a distinct sentinel — and fix the
|
||||||
|
`StoreAndForwardMessage.MaxRetries` doc and `RetryMessageAsync` guard so `0` no longer
|
||||||
|
means "no limit". Update the two `ZeroMaxRetries...` tests to assert the *sweep*
|
||||||
|
outcome (parked / not retried), not just the stored column value.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### ExternalSystemGateway-016 — `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Code organization & conventions |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:21-29` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
The `ExternalSystemGateway-013` fix wires `MaxConcurrentConnectionsPerSystem` by
|
||||||
|
calling `services.ConfigureHttpClientDefaults(builder => builder.ConfigurePrimaryHttpMessageHandler(...))`.
|
||||||
|
The inline comment claims this "applies to the dynamically-named clients created by
|
||||||
|
`ExternalSystemClient`" — but `ConfigureHttpClientDefaults` is **process-global**: it
|
||||||
|
adds the configuration to *every* `HttpClient`/`IHttpClientFactory` client created
|
||||||
|
anywhere in the host, regardless of name.
|
||||||
|
|
||||||
|
The Host registers the External System Gateway alongside other components that also
|
||||||
|
use `IHttpClientFactory` — notably `ScadaLink.NotificationService` (`OAuth2TokenService`
|
||||||
|
and its `ServiceCollectionExtensions` call `AddHttpClient`). With the ESG registration
|
||||||
|
present, the OAuth2 token client (and any future `HttpClient` consumer in the host)
|
||||||
|
has its **primary handler replaced** by a `SocketsHttpHandler` whose
|
||||||
|
`MaxConnectionsPerServer` is the ESG's `MaxConcurrentConnectionsPerSystem`. That:
|
||||||
|
|
||||||
|
1. Silently caps unrelated clients at a value owned by, and named for, a different
|
||||||
|
component — an operator tuning the ESG option unknowingly throttles Microsoft 365
|
||||||
|
OAuth2 token acquisition.
|
||||||
|
2. Overrides/discards any primary-handler configuration those other components add for
|
||||||
|
their own clients (e.g. a custom handler, proxy, or certificate settings).
|
||||||
|
|
||||||
|
This is a leaky, surprising side effect for what the option claims to be a per-ESG
|
||||||
|
setting; `ConfigureHttpClientDefaults` should not be used to express a single
|
||||||
|
component's policy.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Scope the handler configuration to the gateway's own clients only. The ESG already
|
||||||
|
creates per-system clients with a deterministic name pattern
|
||||||
|
(`ExternalSystem_{system.Name}`); register a typed/named `HttpClient` (or a small
|
||||||
|
factory abstraction) for that pattern and call `ConfigurePrimaryHttpMessageHandler`
|
||||||
|
on that registration instead of on the global defaults. If a name-pattern registration
|
||||||
|
is impractical, document the global effect explicitly and rename the option, but the
|
||||||
|
preferred fix is to stop using `ConfigureHttpClientDefaults`.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### ExternalSystemGateway-017 — `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:324-333` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
In `BuildUrl`, the GET/DELETE query-string branch is entered when
|
||||||
|
`parameters != null && parameters.Count > 0`, but the projection then filters out
|
||||||
|
null-valued entries (`parameters.Where(p => p.Value != null)`). When a GET/DELETE
|
||||||
|
method is invoked with a non-empty parameter dictionary whose values are *all* null,
|
||||||
|
`queryString` is the empty string and the code still executes `url += "?" + queryString`,
|
||||||
|
producing a URL ending in a bare `?` (e.g. `https://host/api/resource?`). Most servers
|
||||||
|
tolerate a trailing `?`, but it is an unintended artifact, can defeat response caching
|
||||||
|
on some endpoints, and makes captured request URLs harder to read in logs.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Only append `"?" + queryString` when `queryString` is non-empty (compute the joined
|
||||||
|
string first and check it), so a method whose effective parameter set is empty
|
||||||
|
produces a clean URL identical to the no-parameters case.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -5,10 +5,10 @@
|
|||||||
| Module | `src/ScadaLink.HealthMonitoring` |
|
| Module | `src/ScadaLink.HealthMonitoring` |
|
||||||
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
| Design doc | `docs/requirements/Component-HealthMonitoring.md` |
|
||||||
| Status | Reviewed |
|
| Status | Reviewed |
|
||||||
| Last reviewed | 2026-05-16 |
|
| Last reviewed | 2026-05-17 |
|
||||||
| Reviewer | claude-agent |
|
| Reviewer | claude-agent |
|
||||||
| Commit reviewed | `9c60592` |
|
| Commit reviewed | `39d737e` |
|
||||||
| Open findings | 0 |
|
| Open findings | 4 |
|
||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
@@ -32,20 +32,38 @@ heartbeat path, and most collector setters. None of the findings are crash-class
|
|||||||
but the concurrency issues are Medium/High and the missing S&F metric is a real
|
but the concurrency issues are Medium/High and the missing S&F metric is a real
|
||||||
design-adherence gap.
|
design-adherence gap.
|
||||||
|
|
||||||
|
#### Re-review 2026-05-17 (commit `39d737e`)
|
||||||
|
|
||||||
|
All twelve prior findings (HealthMonitoring-001..012) are confirmed `Resolved` —
|
||||||
|
`SiteHealthState` is now an immutable `sealed record` mutated only via atomic
|
||||||
|
compare-and-swap, the store-and-forward buffer-depth metric is populated, the
|
||||||
|
central-site offline grace and the unknown-site heartbeat registration are in
|
||||||
|
place, and the test suite has grown to cover the report loop, heartbeat path, and
|
||||||
|
collector setters. This re-review found **4 new findings, all Low/Medium, none
|
||||||
|
crash-class**. They are residual polish items rather than behaviour regressions:
|
||||||
|
an inaccurate offline-check-interval comment (HealthMonitoring-013), unvalidated
|
||||||
|
`HealthMonitoringOptions` intervals that crash the hosted service on
|
||||||
|
misconfiguration (HealthMonitoring-014), a heartbeat-only registered site left
|
||||||
|
with a year-0001 `LastReportReceivedAt` that the UI's staleness display must
|
||||||
|
special-case (HealthMonitoring-015), and `CollectReport` reading
|
||||||
|
`DateTimeOffset.UtcNow` directly instead of the module's now-standard injected
|
||||||
|
`TimeProvider` (HealthMonitoring-016). The module remains small, readable, and
|
||||||
|
broadly faithful to the design intent.
|
||||||
|
|
||||||
## Checklist coverage
|
## Checklist coverage
|
||||||
|
|
||||||
| # | Category | Examined | Notes |
|
| # | Category | Examined | Notes |
|
||||||
|---|----------|----------|-------|
|
|---|----------|----------|-------|
|
||||||
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). |
|
| 1 | Correctness & logic bugs | x | `MarkHeartbeat` drops heartbeats for unregistered sites (HealthMonitoring-007); central self-report has no heartbeat grace (HealthMonitoring-005). Re-review: heartbeat-registered site left with year-0001 `LastReportReceivedAt` (HealthMonitoring-015). |
|
||||||
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
|
| 2 | Akka.NET conventions | x | Module itself contains no actors (transport abstracted via `IHealthReportTransport`); `AddHealthMonitoringActors` is a dead placeholder (HealthMonitoring-011). Actor-side wiring lives in Communication and is out of scope. |
|
||||||
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
|
| 3 | Concurrency & thread safety | x | Unguarded mutable `SiteHealthState` (HealthMonitoring-002); mutation inside `AddOrUpdate` delegate (HealthMonitoring-003); `GetAllSiteStates` leaks live mutable references (HealthMonitoring-008). Collector counters correctly use `Interlocked`. |
|
||||||
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010); top-level loop error handling is sound. |
|
| 4 | Error handling & resilience | x | `HealthReportSender` silently swallows inner failures with bare `catch {}` (HealthMonitoring-010, resolved); top-level loop error handling is sound. Re-review: `HealthMonitoringOptions` intervals unvalidated — a zero/negative value crashes the hosted service at `PeriodicTimer` construction (HealthMonitoring-014). |
|
||||||
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
|
| 5 | Security | x | No issues found. Module handles only numeric/string operational metrics, no secrets, no external input parsing, no auth surface. |
|
||||||
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
|
| 6 | Performance & resource management | x | `PeriodicTimer` instances correctly disposed via `using`. Dictionary snapshots per report are acceptable at the documented scale. No issues found. |
|
||||||
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
|
| 7 | Design-document adherence | x | Store-and-forward buffer depth metric unimplemented (HealthMonitoring-001); sequence seeding deviates from doc's "starting at 1" wording (HealthMonitoring-006). |
|
||||||
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011). |
|
| 8 | Code organization & conventions | x | Options class correctly owned by the component; POCO/messages in Commons. Dead placeholder method noted (HealthMonitoring-011, resolved). Re-review: `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of the module's now-standard injected `TimeProvider` (HealthMonitoring-016). |
|
||||||
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
| 9 | Testing coverage | x | No tests for `CentralHealthReportLoop`, `MarkHeartbeat`, offline-via-heartbeat, replica idempotency, or most collector setters (HealthMonitoring-009). |
|
||||||
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012). |
|
| 10 | Documentation & comments | x | Heartbeat interval is described inconsistently (~2s vs ~5s) across XML docs (HealthMonitoring-004, resolved); `LatestReport = null!` misrepresents the contract (HealthMonitoring-012, resolved). Re-review: offline-check-interval comment claims "(shorter)" timeout but code only uses `OfflineTimeout` (HealthMonitoring-013). |
|
||||||
|
|
||||||
## Findings
|
## Findings
|
||||||
|
|
||||||
@@ -560,3 +578,148 @@ has not yet sent a report"). A codebase-wide search confirms no `null!` suppress
|
|||||||
remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
|
remains anywhere in `src/ScadaLink.HealthMonitoring`. This is exactly the change
|
||||||
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
|
HealthMonitoring-002 made when converting `SiteHealthState` to an immutable record, so
|
||||||
the contract is now honest and no further code change was required.
|
the contract is now honest and no further code change was required.
|
||||||
|
|
||||||
|
### HealthMonitoring-013 — Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout`
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Documentation & comments |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:194-196` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`ExecuteAsync` derives the `PeriodicTimer` cadence with the comment "Check at half
|
||||||
|
the (shorter) offline timeout interval for timely detection", but the code only
|
||||||
|
reads `_options.OfflineTimeout`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var checkInterval = TimeSpan.FromMilliseconds(_options.OfflineTimeout.TotalMilliseconds / 2);
|
||||||
|
```
|
||||||
|
|
||||||
|
`CentralOfflineTimeout` (HealthMonitoring-005's fix) is never considered. With the
|
||||||
|
default options (`OfflineTimeout` 60s, `CentralOfflineTimeout` 3m) `OfflineTimeout`
|
||||||
|
genuinely is the shorter of the two, so the parenthetical happens to hold. But the
|
||||||
|
comment states an invariant the code does not enforce: if an operator configures
|
||||||
|
`CentralOfflineTimeout` *smaller* than `OfflineTimeout`, the check cadence stays
|
||||||
|
tied to `OfflineTimeout`, and central offline detection is delayed by up to a full
|
||||||
|
`OfflineTimeout / 2` beyond the intended `CentralOfflineTimeout` window. The comment
|
||||||
|
misleads a reader into believing the cadence already adapts to whichever timeout is
|
||||||
|
shorter.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Either compute `checkInterval` from `Math.Min(OfflineTimeout, CentralOfflineTimeout)`
|
||||||
|
so the code matches the comment, or drop the "(shorter)" wording and state plainly
|
||||||
|
that the cadence is derived from `OfflineTimeout` only (acceptable while the default
|
||||||
|
`CentralOfflineTimeout` is the larger value).
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### HealthMonitoring-014 — `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Error handling & resilience |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.HealthMonitoring/HealthMonitoringOptions.cs:3-20`, `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:196`, `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:67`, `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:63` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`HealthMonitoringOptions` is bound from the `ScadaLink:HealthMonitoring` config
|
||||||
|
section (`SiteServiceRegistration.BindSharedOptions`) with no validation —
|
||||||
|
no `IValidateOptions<HealthMonitoringOptions>`, no `ValidateDataAnnotations`, no
|
||||||
|
`ValidateOnStart`. `ReportInterval`, `OfflineTimeout`, and `CentralOfflineTimeout`
|
||||||
|
are all fed straight into `new PeriodicTimer(...)` (and `OfflineTimeout` into a
|
||||||
|
division for the check interval). `PeriodicTimer`'s constructor throws
|
||||||
|
`ArgumentOutOfRangeException` for a zero or negative period. A misconfigured
|
||||||
|
`appsettings.json` (e.g. `"ReportInterval": "00:00:00"`, an empty/garbled value
|
||||||
|
that binds to `TimeSpan.Zero`, or a negative span) therefore crashes the
|
||||||
|
`HealthReportSender` / `CentralHealthReportLoop` / `CentralHealthAggregator`
|
||||||
|
hosted service at startup with an opaque exception that does not name the
|
||||||
|
offending config key, rather than failing fast with a clear validation message.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Add an options validator (DataAnnotations `[Range]`-style on the spans, or an
|
||||||
|
`IValidateOptions<HealthMonitoringOptions>`) that rejects non-positive
|
||||||
|
`ReportInterval`/`OfflineTimeout`/`CentralOfflineTimeout` and ideally requires
|
||||||
|
`CentralOfflineTimeout >= OfflineTimeout`, and call `.ValidateOnStart()` so a bad
|
||||||
|
configuration fails fast with a message naming the section and key.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### HealthMonitoring-015 — Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt`
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Medium |
|
||||||
|
| Category | Correctness & logic bugs |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthAggregator.cs:122-130`, `src/ScadaLink.HealthMonitoring/SiteHealthState.cs:27` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
When `MarkHeartbeat` registers a previously-unknown site (HealthMonitoring-007's
|
||||||
|
fix), it sets `LastReportReceivedAt = default` — i.e. `DateTimeOffset.MinValue`
|
||||||
|
(`0001-01-01`). The XML doc on `SiteHealthState.LastReportReceivedAt` states the
|
||||||
|
field is "Used by the UI to surface report staleness during failover." A
|
||||||
|
heartbeat-only site therefore has `LatestReport == null` **and**
|
||||||
|
`LastReportReceivedAt == DateTimeOffset.MinValue`. Any UI code that computes
|
||||||
|
"last report N ago" as `now - LastReportReceivedAt` without first checking
|
||||||
|
`LatestReport != null` will render a nonsensical staleness of roughly two
|
||||||
|
thousand years for a site that is, in fact, freshly reachable. The two
|
||||||
|
"no report yet" signals (`LatestReport == null`, `LastReportReceivedAt == default`)
|
||||||
|
are independent and both must be special-cased; the sentinel value is an easy trap.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Make `LastReportReceivedAt` nullable (`DateTimeOffset?`) so "no report received
|
||||||
|
yet" is an explicit, unmissable state rather than a magic sentinel — consistent
|
||||||
|
with how `LatestReport` was already made nullable for the same case — and have UI
|
||||||
|
consumers render staleness only when it has a value. Alternatively, document the
|
||||||
|
`default` sentinel prominently on the field and audit every UI reader, but the
|
||||||
|
nullable option is safer and matches the existing `LatestReport` treatment.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|
||||||
|
### HealthMonitoring-016 — `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider`
|
||||||
|
|
||||||
|
| | |
|
||||||
|
|--|--|
|
||||||
|
| Severity | Low |
|
||||||
|
| Category | Code organization & conventions |
|
||||||
|
| Status | Open |
|
||||||
|
| Location | `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:151` |
|
||||||
|
|
||||||
|
**Description**
|
||||||
|
|
||||||
|
`CollectReport` stamps each report with `ReportTimestamp: DateTimeOffset.UtcNow`,
|
||||||
|
read directly from the system clock. Every other time-dependent class in the
|
||||||
|
module — `CentralHealthAggregator`, `HealthReportSender`, `CentralHealthReportLoop`
|
||||||
|
— was deliberately refactored (HealthMonitoring-006) to take an injectable
|
||||||
|
`TimeProvider` so the behaviour is deterministically testable and the clock
|
||||||
|
dependency is explicit. `SiteHealthCollector` is the lone holdout: the report
|
||||||
|
timestamp cannot be controlled in a unit test, which is why
|
||||||
|
`SiteHealthCollectorTests.CollectReport_IncludesUtcTimestamp` can only assert the
|
||||||
|
timestamp falls in a `before`/`after` wall-clock window rather than equalling a
|
||||||
|
known instant. This is a minor consistency/testability gap, not a behaviour bug.
|
||||||
|
|
||||||
|
**Recommendation**
|
||||||
|
|
||||||
|
Add an optional `TimeProvider` constructor parameter to `SiteHealthCollector`
|
||||||
|
(defaulting to `TimeProvider.System`, mirroring the other classes) and derive
|
||||||
|
`ReportTimestamp` from `GetUtcNow()`, so the report timestamp is deterministically
|
||||||
|
testable and the module is consistent.
|
||||||
|
|
||||||
|
**Resolution**
|
||||||
|
|
||||||
|
_Unresolved._
|
||||||
|
|||||||
@@ -40,10 +40,10 @@ module file and counted in **Total**.
|
|||||||
| Severity | Open findings |
|
| Severity | Open findings |
|
||||||
|----------|---------------|
|
|----------|---------------|
|
||||||
| Critical | 0 |
|
| Critical | 0 |
|
||||||
| High | 2 |
|
| High | 5 |
|
||||||
| Medium | 5 |
|
| Medium | 12 |
|
||||||
| Low | 10 |
|
| Low | 17 |
|
||||||
| **Total** | **17** |
|
| **Total** | **34** |
|
||||||
|
|
||||||
## Module Status
|
## Module Status
|
||||||
|
|
||||||
@@ -54,11 +54,11 @@ module file and counted in **Total**.
|
|||||||
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
|
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/1 | 2 | 10 |
|
||||||
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
|
| [Commons](Commons/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/2 | 2 | 14 |
|
||||||
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
|
| [Communication](Communication/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/2 | 4 | 15 |
|
||||||
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-16 | `9c60592` | 0/0/2/1 | 3 | 14 |
|
||||||
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-16 | `9c60592` | 0/1/2/1 | 4 | 17 |
|
||||||
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
||||||
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 14 |
|
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-16 | `9c60592` | 0/1/1/1 | 3 | 17 |
|
||||||
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 12 |
|
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-16 | `9c60592` | 0/0/1/3 | 4 | 16 |
|
||||||
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
| [Host](Host/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 11 |
|
||||||
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
| [InboundAPI](InboundAPI/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||||
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
| [ManagementService](ManagementService/findings.md) | 2026-05-16 | `9c60592` | 0/0/0/0 | 0 | 13 |
|
||||||
@@ -80,14 +80,17 @@ description, location, recommendation — lives in the module's `findings.md`.
|
|||||||
|
|
||||||
_None open._
|
_None open._
|
||||||
|
|
||||||
### High (2)
|
### High (5)
|
||||||
|
|
||||||
| ID | Module | Title |
|
| ID | Module | Title |
|
||||||
|----|--------|-------|
|
|----|--------|-------|
|
||||||
| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
|
| CentralUI-020 | [CentralUI](CentralUI/findings.md) | Idle-session redirect never fires: `SessionExpiry` polls a frozen auth-state snapshot |
|
||||||
| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
|
| Communication-012 | [Communication](Communication/findings.md) | gRPC client factory ignores the endpoint on a cache hit, breaking NodeA→NodeB stream failover |
|
||||||
|
| DataConnectionLayer-014 | [DataConnectionLayer](DataConnectionLayer/findings.md) | DCL-012 security warning is never logged in production: `RealOpcUaClient` is created without a logger |
|
||||||
|
| DeploymentManager-015 | [DeploymentManager](DeploymentManager/findings.md) | Site-query reconciliation marks a deployment `Success` but skips instance-state and snapshot updates |
|
||||||
|
| ExternalSystemGateway-015 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `MaxRetries == 0` is buffered as "retry forever", contradicting the ExternalSystemGateway-004 "never retry" claim |
|
||||||
|
|
||||||
### Medium (5)
|
### Medium (12)
|
||||||
|
|
||||||
| ID | Module | Title |
|
| ID | Module | Title |
|
||||||
|----|--------|-------|
|
|----|--------|-------|
|
||||||
@@ -96,8 +99,15 @@ _None open._
|
|||||||
| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
|
| CentralUI-022 | [CentralUI](CentralUI/findings.md) | `Deployments` push handler fires `InvokeAsync` with no disposal guard |
|
||||||
| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
|
| ClusterInfrastructure-009 | [ClusterInfrastructure](ClusterInfrastructure/findings.md) | `DownIfAlone` is an inert configuration knob — never consumed by the HOCON builder |
|
||||||
| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
|
| Communication-013 | [Communication](Communication/findings.md) | Site gRPC address changes are never applied; `RemoveSiteAsync` has no production caller |
|
||||||
|
| ConfigurationDatabase-012 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Inbound-API `ApiKey.KeyValue` bearer credential stored in plaintext |
|
||||||
|
| ConfigurationDatabase-013 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Secret-column encryption silently falls back to an ephemeral (throwaway) key |
|
||||||
|
| DataConnectionLayer-015 | [DataConnectionLayer](DataConnectionLayer/findings.md) | Initial-connect failures never trigger failover; an unreachable primary at startup never tries the backup |
|
||||||
|
| DataConnectionLayer-016 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `HandleSubscribeCompleted` reports `SubscribeTagsResponse` success even on a connection-level subscribe failure |
|
||||||
|
| DeploymentManager-016 | [DeploymentManager](DeploymentManager/findings.md) | Reconciled prior record keeps its stale `RevisionHash` |
|
||||||
|
| ExternalSystemGateway-016 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ConfigureHttpClientDefaults` applies the ESG connection cap to every `HttpClient` in the host process |
|
||||||
|
| HealthMonitoring-015 | [HealthMonitoring](HealthMonitoring/findings.md) | Heartbeat-registered site is left with a year-0001 `LastReportReceivedAt` |
|
||||||
|
|
||||||
### Low (10)
|
### Low (17)
|
||||||
|
|
||||||
| ID | Module | Title |
|
| ID | Module | Title |
|
||||||
|----|--------|-------|
|
|----|--------|-------|
|
||||||
@@ -111,3 +121,10 @@ _None open._
|
|||||||
| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
|
| Commons-014 | [Commons](Commons/findings.md) | `OpcUaEndpointConfigSerializer.Deserialize` can mislabel a corrupt typed row as `Legacy` |
|
||||||
| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
|
| Communication-014 | [Communication](Communication/findings.md) | Untrusted gRPC `correlation_id` flows directly into an Akka actor name |
|
||||||
| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |
|
| Communication-015 | [Communication](Communication/findings.md) | No test exercises the real gRPC client factory across a node flip |
|
||||||
|
| ConfigurationDatabase-014 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Redundant, inconsistent cast on one `HasConversion` call |
|
||||||
|
| DataConnectionLayer-017 | [DataConnectionLayer](DataConnectionLayer/findings.md) | `WriteBatchAsync` aborts the whole batch on a mid-batch disconnect |
|
||||||
|
| DeploymentManager-017 | [DeploymentManager](DeploymentManager/findings.md) | `GetDeploymentStatusAsync` XML doc describes behaviour it does not implement |
|
||||||
|
| ExternalSystemGateway-017 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `BuildUrl` appends a bare trailing `?` when a GET method's parameters are all null |
|
||||||
|
| HealthMonitoring-013 | [HealthMonitoring](HealthMonitoring/findings.md) | Offline-check interval comment claims "shorter timeout" but only ever uses `OfflineTimeout` |
|
||||||
|
| HealthMonitoring-014 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthMonitoringOptions` intervals are unvalidated; a zero/negative value crashes the hosted service |
|
||||||
|
| HealthMonitoring-016 | [HealthMonitoring](HealthMonitoring/findings.md) | `SiteHealthCollector.CollectReport` reads `DateTimeOffset.UtcNow` directly instead of an injected `TimeProvider` |
|
||||||
|
|||||||
Reference in New Issue
Block a user