Files
Joseph Doherty c899cb162c refactor: scrub residual ScadaLink refs → ScadaBridge (env vars, config keys, assembly name, SQL login)
Renames the 13 SCADALINK_* runtime env vars → SCADABRIDGE_*, the ScadaLink__
.NET config keys → ScadaBridge__, the stale ScadaLink.Host.exe assembly name
→ ZB.MOM.WW.ScadaBridge.Host.exe, the scadalink_app SQL login → scadabridge_app,
and residual identifiers/comments/docs. Migration records (prior rename
tooling/design, DB-rename helper, this scrub script) carved out.

Adds tools/scrub-scadalink-refs.sh.
2026-05-31 21:50:38 -04:00

82 KiB
Raw Permalink Blame History

Code Review — ConfigurationDatabase

Field Value
Module src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase
Design doc docs/requirements/Component-ConfigurationDatabase.md
Status Reviewed
Last reviewed 2026-05-28
Reviewer claude-agent
Commit reviewed 1eb6e97
Open findings 0

Summary

The ConfigurationDatabase module is a focused, conventional EF Core data-access layer: a single ScadaBridgeDbContext, Fluent API entity configurations, eight repository implementations of Commons-defined interfaces, an IAuditService implementation, an IInstanceLocator, environment-aware migration handling, and design-time tooling support. Overall structure adheres well to the design doc and the CLAUDE.md "Code Organization" decisions — POCO entities and interfaces live in Commons, EF mappings and implementations live here, Fluent API only, and optimistic concurrency is correctly applied to DeploymentRecord via rowversion. The module is generally healthy.

The main themes across findings are: (1) a genuine logic bug in GetTemplateWithChildrenAsync, which loads child templates and then discards them, so the method does not deliver what its name implies; (2) secret-bearing columns (SMTP credentials, external-system auth config, database connection strings) persisted in plaintext with no encryption-at-rest; (3) a hardcoded SQL sa connection string with a password literal embedded in DesignTimeDbContextFactory; (4) the no-arg AddConfigurationDatabase() overload, which silently registers nothing, making a misconfigured central node fail late and opaquely; and (5) audit-trail robustness gaps — AuditService can throw on serializing entities with navigation cycles, rolling back the whole business operation, and the design doc's claim that audit Id is Long/GUID disagrees with the int entity. Test coverage is good for the repositories that have tests (Security, CentralUI, audit, concurrency, seed data, data protection) but several repositories (TemplateEngineRepository, DeploymentManagerRepository, ExternalSystemRepository, InboundApiRepository, NotificationRepository, SiteRepository, InstanceLocator) have little or no direct coverage.

Re-review 2026-05-17 (commit 39d737e)

Re-reviewed the module at commit 39d737e. All eleven findings from the initial review (9c60592) remain Resolved — the secret-column encryption (EncryptedStringConverter + ApplySecretColumnEncryption), the fail-fast no-arg DI overload, the AsSplitQuery conversions, the audit cycle-safe serializer, and the added repository test coverage all verified present and consistent with their resolutions. Three new findings were recorded. The most material is that the encryption work done for CD-004 left one bearer credential out of scope: ApiKey.KeyValue — the inbound-API authentication secret — is still persisted in plaintext (ConfigurationDatabase-012); it cannot use the same non-deterministic Data Protection converter because authentication looks the key up by value, so it needs a hash-based scheme instead. The second is a resilience gap in the encryption plumbing itself: ApplySecretColumnEncryption silently substitutes a throwaway EphemeralDataProtectionProvider whenever no provider is supplied, so any context constructed via the single-argument constructor on a write path would encrypt secrets with a key discarded at process exit, yielding permanently undecryptable ciphertext with no error (ConfigurationDatabase-013). The third is a minor inconsistency — a redundant cast on one of the three HasConversion calls (ConfigurationDatabase-014). The module is otherwise healthy and the prior fixes hold up well.

Re-review 2026-05-28 (commit 1eb6e97)

Re-reviewed the module at commit 1eb6e97. All fourteen prior findings remain Resolved; their fixes still hold (encryption converter, fail-fast guard, peppered API-key hash, ephemeral-fallback hardening, etc.). The module has grown since the last review — new code includes Audit Log (#23) raw-SQL paths in AuditLogRepository (partition-switch purge, recursive execution-tree CTE, KPI snapshot, partition-boundary discovery), the AuditLogPartitionMaintenance SPLIT-RANGE roll-forward implementation, the AuditCorrelationContext scoped service that stamps BundleImportId, the SiteCallAuditRepository monotonic-rank upsert, and the NotificationOutboxRepository per-site KPI surface — and most of the new findings are concentrated in those raw-SQL paths and in latent gaps left behind by the CD-012 hash migration.

Ten new findings were recorded. The most material is ConfigurationDatabase-015: a check-then-act race in NotificationOutboxRepository.InsertIfNotExistsAsync with no duplicate-key catch — unlike the sibling Audit Log / Site Call ingest paths, a concurrent ack-after-persist on the same NotificationId will surface as an unhandled DbUpdateException and break the at-least-once site→central handoff. ConfigurationDatabase-016 flags that InboundApiRepository.GetApiKeyByValueAsync hashes the candidate with ApiKeyHasher.Default (unpeppered) while the production create-path uses the configured peppered hasher — any future caller (or test that exercises the method) will silently fail to find a real key; the production ApiKeyValidator happens not to call it, but the method is a publicly exposed IInboundApiRepository member and a latent bug. ConfigurationDatabase-017 records that the DeleteDeploymentRecordAsync stub-attach delete bypasses the documented optimistic-concurrency rule on DeploymentRecord.RowVersion — the SQLite tests pass because the test fixture re-maps RowVersion as a nullable concurrency token, but in production this is likely to throw DbUpdateConcurrencyException. ConfigurationDatabase-018 records the DateTime-typed *Utc columns on AuditEvent and SiteCall re-emerge as Kind=Unspecified on read; the sibling Commons module flagged the same pattern as Commons-019, and AuditLogPartitionMaintenance.GetMaxBoundaryAsync already defends against it with an explicit SpecifyKind(Utc) — but GetPartitionBoundariesOlderThanAsync does not (ConfigurationDatabase-020). ConfigurationDatabase-019 is the SPLIT-RANGE loop in AuditLogPartitionMaintenance.EnsureLookaheadAsync swallowing every SqlException as a Warning and continuing — a genuine failure (permissions, deadlock, transient) leaves a missing boundary and the next iteration cheerfully splits the following month, creating a hole. ConfigurationDatabase-021 is a low-severity hardening concern around SwitchOutPartitionAsync's raw-SQL interpolation of monthBoundaryStr / stagingTableName (currently safe by construction, but truncates fractional seconds). ConfigurationDatabase-022 is the stale "WP-24 stub" XML comment on DeploymentManagerRepository. ConfigurationDatabase-023 is a design-doc-adherence drift on IX_AuditLog_CorrelationId (design says IX_AuditLog_Correlation). ConfigurationDatabase-024 is missing test coverage for the SPLIT-RANGE failure-continuation behaviour and for the production-shape stub-attach delete with a real rowversion.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs GetTemplateWithChildrenAsync discards loaded children (CD-001); GetApprovedKeysForMethodAsync CSV parsing is brittle (CD-008).
2 Akka.NET conventions No actors in this module; data-access layer only. No issues found.
3 Concurrency & thread safety DbContext correctly scoped; optimistic concurrency on DeploymentRecord correct. Repositories hold no shared mutable state. No issues found.
4 Error handling & resilience WaitForDatabaseReadyAsync is sound. No-arg DI overload fails late and silently (CD-003, resolved); audit JSON serialization failure handling (CD-007, resolved). Re-review: ephemeral Data Protection fallback can silently produce undecryptable ciphertext (CD-013).
5 Security Hardcoded sa credential literal (CD-002, resolved); SMTP/DB-connection/auth secrets unencrypted (CD-004, resolved). Re-review: ApiKey.KeyValue bearer credential still stored in plaintext (CD-012).
6 Performance & resource management GetAllTemplatesAsync / GetTemplateTreeAsync eager-load multiple collections without AsSplitQuery (CD-009, resolved). No N+1 in audited paths. Re-review: no new issues.
7 Design-document adherence Audit Id type mismatch vs design doc (CD-005, resolved); seed data uses HasData consistent with design. Re-review: no new issues.
8 Code organization & conventions Mostly clean. Grpc* address columns unbounded (CD-006, resolved); inconsistent null-guard on injected context (CD-011, resolved). Re-review: redundant/inconsistent cast on one HasConversion call (CD-014).
9 Testing coverage Several repositories and InstanceLocator lack direct tests (CD-010).
10 Documentation & comments DeploymentManagerRepository "WP-24 stub" XML comment is stale; noted in module context but not raised as a standalone finding. No issues found beyond items above.

Re-review (2026-05-28, 1eb6e97):

# Category Examined Notes
1 Correctness & logic bugs GetPartitionBoundariesOlderThanAsync returns DateTimeKind.Unspecified (CD-020). GetApiKeyByValueAsync hashes with the unpeppered default (CD-016).
2 Akka.NET conventions No actors in this module. No issues found.
3 Concurrency & thread safety NotificationOutboxRepository.InsertIfNotExistsAsync check-then-act has no duplicate-key catch (CD-015). Stub-attach delete bypasses documented optimistic concurrency on DeploymentRecord.RowVersion (CD-017).
4 Error handling & resilience AuditLogPartitionMaintenance.EnsureLookaheadAsync swallows non-idempotent SPLIT failures and continues (CD-019).
5 Security SwitchOutPartitionAsync interpolates a DateTime string and a GUID-suffixed identifier into raw SQL — safe by construction but pattern is risky (CD-021).
6 Performance & resource management No new issues found.
7 Design-document adherence Index name drift: design says IX_AuditLog_Correlation, code uses IX_AuditLog_CorrelationId (CD-023).
8 Code organization & conventions DateTime *Utc columns on AuditEvent / SiteCall carry no DateTimeKind enforcement (CD-018).
9 Testing coverage No tests for SPLIT failure continuation and no production-shape rowversion stub-attach test (CD-024).
10 Documentation & comments Stale "WP-24 stub" XML comment on DeploymentManagerRepository (CD-022).

Findings

ConfigurationDatabase-001 — GetTemplateWithChildrenAsync loads child templates then discards them

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs:30-41

Description

GetTemplateWithChildrenAsync queries for all templates whose ParentTemplateId equals the requested id, assigns the result to the local variable children, and then returns template — the children list is never used, attached to the returned object, or otherwise exposed. The method is therefore behaviourally identical to GetTemplateByIdAsync but issues an extra database round-trip. Any caller relying on the method name to obtain a template with its derived/child templates populated will silently receive a template with no children, leading to incorrect template-resolution or UI behaviour with no error.

Recommendation

Either populate the children onto the returned aggregate (e.g. project into a result type that carries the children, or load them into a navigation collection that is actually returned), or remove the dead query and the misleading method if children are not in fact needed. If the navigation does not exist on the Template entity, add an explicit result tuple/DTO so the loaded data reaches the caller.

Resolution

Resolved 2026-05-16 (commit <pending>). Root cause confirmed against source: the method ran a Where(t => t.ParentTemplateId == id) query, assigned the result to a local children variable, and never used it — a misleading no-op that also issued an extra database round-trip per call.

Triage of the three callers (FlatteningPipeline.BuildTemplateChainAsync, ManagementActor.HandleGetTemplate, ManagementActor.HandleValidateTemplate) showed none consume derived/sub-templates; they all need the template's member collections (Attributes/Alarms/Scripts/Compositions), which GetTemplateByIdAsync already eager-loads. The Template entity has no child-templates navigation collection, and adding one (plus changing the interface signature) would require editing ZB.MOM.WW.ScadaBridge.Commons, which is outside this module's scope.

Fix applied the recommendation's secondary option: removed the dead query so the method no longer misleads or wastes a round-trip, and added an XML doc comment clarifying that "children" means the template's member collections. The method now honestly delegates to GetTemplateByIdAsync. Regression tests added in TemplateEngineRepositoryTests.cs: GetTemplateWithChildrenAsync_ReturnsTemplateWithAllMemberCollectionsPopulated, GetTemplateWithChildrenAsync_PreservesParentTemplateId_ForInheritanceChainWalk, and GetTemplateWithChildrenAsync_ReturnsNull_WhenTemplateDoesNotExist — pinning the template-aggregate contract the callers depend on.

ConfigurationDatabase-002 — Hardcoded sa connection string with embedded password literal

Severity Medium
Category Security
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/DesignTimeDbContextFactory.cs:21-22

Description

DesignTimeDbContextFactory falls back to a literal connection string "Server=localhost,1433;Database=ScadaBridge_Config;User Id=sa;Password=YourPassword;TrustServerCertificate=True" when no configured connection string is found. Embedding a credential literal (even a placeholder) in source code is a poor pattern: it is committed to version control, encourages copy-paste of sa/TrustServerCertificate=True into real environments, and the fallback can mask a genuine misconfiguration during dotnet ef operations by silently pointing tooling at an unintended database.

Recommendation

Remove the hardcoded fallback. If no connection string is resolved from configuration or environment, throw a clear InvalidOperationException instructing the developer to set ScadaBridge:Database:ConfigurationDb (or an environment variable). At minimum, read the design-time connection string from an environment variable rather than a literal, and never use sa.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: the factory fell back to a literal User Id=sa;Password=YourPassword;... connection string when no configured value was found. Removed the hardcoded fallback entirely. The factory now resolves the connection string from the Host's appsettings files or, when those are not present, from the SCADABRIDGE_DESIGNTIME_CONNECTIONSTRING environment variable, and throws a clear InvalidOperationException (naming both the config key and the env var) when neither yields a value. Also hardened SetBasePath to be applied only when the ZB.MOM.WW.ScadaBridge.Host directory exists, so the factory degrades cleanly instead of throwing DirectoryNotFoundException when run from a context without a sibling Host folder. Regression tests added in DesignTimeDbContextFactoryTests.cs: CreateDbContext_NoConnectionStringConfigured_ThrowsClearException, CreateDbContext_ConnectionStringFromEnvironmentVariable_IsUsed, and DesignTimeDbContextFactory_SourceContainsNoHardcodedSaCredential.

ConfigurationDatabase-003 — No-arg AddConfigurationDatabase() silently registers nothing

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/ServiceCollectionExtensions.cs:44-49

Description

The parameterless AddConfigurationDatabase() overload is a deliberate no-op "retained for backward compatibility during migration." If a central node is wired up with this overload by mistake, no ScadaBridgeDbContext, repositories, IAuditService, or IInstanceLocator are registered. The failure does not surface at startup; it surfaces much later as opaque DI resolution exceptions the first time any consumer requests a repository — far from the actual misconfiguration. The XML comment also refers to "Phase 0 stubs," which is stale relative to the current state of the module.

Recommendation

Either delete the no-op overload now that the connection-string overload exists, or mark it [Obsolete] with an error-level message so misuse is a compile-time failure. If a true "site node" no-op is genuinely required, give it an explicit, self-documenting name (e.g. AddConfigurationDatabaseNoOp()), and remove the stale "Phase 0" wording.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: the parameterless AddConfigurationDatabase() overload returned services unchanged, registering no DbContext, repositories, IAuditService, or IInstanceLocator. Applied the recommendation's first option: the overload is now marked [Obsolete(..., error: true)] so any source reference is a compile-time failure, and its body throws InvalidOperationException with an actionable message as defence-in-depth (covering reflection-based invocation or suppressed warnings). The stale "Phase 0 stubs / backward compatibility" XML comment was replaced with one explaining the obsoletion. The pre-existing ServiceRegistrationTests.AddConfigurationDatabase_NoArgs_DoesNotThrow test in UnitTest1.cs, which encoded the old buggy no-op contract, was updated to AddConfigurationDatabase_NoArgs_FailsFast to assert the corrected behaviour. New regression tests added in ServiceCollectionExtensionsTests.cs: AddConfigurationDatabase_NoArgOverload_FailsFastWithClearMessage and AddConfigurationDatabase_NoArgOverload_IsMarkedObsoleteAsError.

ConfigurationDatabase-004 — Secret-bearing columns stored in plaintext with no protection

Severity Medium
Category Security
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/NotificationConfiguration.cs:56-57, src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/ExternalSystemConfiguration.cs:25-26,75-77

Description

SmtpConfiguration.Credentials, ExternalSystemDefinition.AuthConfiguration, and DatabaseConnectionDefinition.ConnectionString all hold authentication secrets (SMTP OAuth2 client secrets / passwords, external-system API keys or Basic Auth credentials, and database passwords respectively). They are mapped as ordinary string columns and persisted verbatim. Anyone with read access to the configuration database — including audit-log JSON if these entities are serialized into AfterStateJson — obtains the plaintext secrets. The design doc does not call out encryption-at-rest for these fields, so the design is also silent on a real risk.

Recommendation

Apply encryption to these fields, e.g. an EF Core value converter backed by ASP.NET Data Protection (the module already configures IDataProtectionKeyContext), or rely on SQL Server Always Encrypted / column encryption. Separately, ensure IAuditService callers never pass these secret-bearing entities (or that the serializer redacts the fields) so secrets do not leak into AuditLogEntry.AfterStateJson. Update the design doc to state the chosen at-rest protection.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: SmtpConfiguration.Credentials, ExternalSystemDefinition.AuthConfiguration, and DatabaseConnectionDefinition.ConnectionString were mapped as ordinary nvarchar(4000) columns and persisted verbatim.

Implemented the recommendation's first option — an in-module EF Core value converter backed by ASP.NET Data Protection, which the module already uses (IDataProtectionKeyContext, AddDataProtection().PersistKeysToDbContext). Added EncryptedStringConverter (purpose-scoped IDataProtector; Protect on write, Unprotect on read; null-safe; surfaces a clear message on a CryptographicException). ScadaBridgeDbContext gained an (options, IDataProtectionProvider) constructor and applies the converter to the three secret columns in OnModelCreating; the DI registration in ServiceCollectionExtensions now constructs the context with the registered provider. The secret columns were widened to HasMaxLength(8000) (EF maps this to nvarchar(max) on SQL Server) so ciphertext expansion cannot truncate the value; migration 20260517010521_EncryptSecretColumns carries the column-type change. Regression tests added in SecretEncryptionTests.cs verify the raw column value is never the plaintext secret and that EF transparently decrypts on read, for all three columns plus a null round-trip.

The encryption scheme itself is fully in-module; the only remaining cross-cutting item is a documentation gap — the design doc does not yet state encryption-at-rest for these fields. That doc update is outside this module's editable scope (constraint: edit only src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase, the tests, and this file) and is surfaced here for a follow-up to docs/requirements/Component-ConfigurationDatabase.md. The audit secret-leak concern is mitigated separately by CD-007's serializer hardening; whether callers should additionally redact secret-bearing entities before passing them to IAuditService is a caller-side concern in other modules and is also surfaced for follow-up. The code fix in this module is complete.

ConfigurationDatabase-005 — Audit Id type disagrees with the design doc

Severity Low
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/AuditConfiguration.cs:11 (entity src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Audit/AuditLogEntry.cs)

Description

The design doc's Audit Entry Schema table specifies Id as Long / GUID, and notes the audit table is append-only and retained indefinitely. The actual AuditLogEntry entity uses an int identity key. For a never-purged, append-only table that accumulates one row per save operation across the system lifetime, a 32-bit identity risks overflow over a long deployment horizon, and the code drifts from the documented schema.

Recommendation

Change AuditLogEntry.Id to long (and the corresponding migration column to bigint) to match the design doc and remove the overflow risk, or — if int is intentional — update the design doc's schema table to say int and justify it. Resolve the discrepancy in one direction.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: the AuditLogEntry entity declares int Id, while the design doc's Audit Entry Schema table said Long / GUID. The entity lives in ZB.MOM.WW.ScadaBridge.Commons (src/ZB.MOM.WW.ScadaBridge.Commons/Entities/Audit/AuditLogEntry.cs), which is outside this module's editable scope, so the discrepancy was resolved by aligning the design doc to the code — the recommendation's second option. The schema table now records Id as int (identity) with an explicit justification: a 32-bit identity matches the key type of every other entity in the schema (uniform repository/query code), and at a sustained 100 rows/second the int range is not exhausted for roughly 680 years, so the indefinite-retention policy poses no realistic overflow risk; if a future deployment ever approaches the limit the column can be widened to bigint via a migration without a schema redesign. No regression test is meaningful for a documentation alignment; the existing AuditConfiguration (HasKey(a => a.Id)) and the audit repository tests already exercise the int key end to end.

ConfigurationDatabase-006 — Site.GrpcNodeAAddress / GrpcNodeBAddress columns are unbounded

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/SiteConfiguration.cs:24-25

Description

SiteConfiguration explicitly sets HasMaxLength(500) for NodeAAddress and NodeBAddress, but the entity also has GrpcNodeAAddress and GrpcNodeBAddress (added per the gRPC streaming design decision) which are not configured at all. With no length set, EF Core maps them to nvarchar(max). This is inconsistent with the sibling address columns, wastes the opportunity to constrain input, and nvarchar(max) columns cannot be indexed and have different storage/performance characteristics.

Recommendation

Add builder.Property(s => s.GrpcNodeAAddress).HasMaxLength(500); and the same for GrpcNodeBAddress, matching the existing NodeAAddress/NodeBAddress mapping, and generate a migration to alter the column types.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: SiteConfiguration configured NodeAAddress/NodeBAddress with HasMaxLength(500) but left GrpcNodeAAddress/GrpcNodeBAddress unconfigured, so EF mapped them to nvarchar(max) — inconsistent with the sibling columns and non-indexable. Applied the recommendation: added builder.Property(s => s.GrpcNodeAAddress).HasMaxLength(500) and the same for GrpcNodeBAddress. Generated migration 20260517020720_BoundGrpcNodeAddressLength altering both columns from nvarchar(max) to nvarchar(500) (the model snapshot was updated to match). Regression tests added in SchemaConfigurationTests.cs: GrpcNodeAddressColumns_AreLengthBoundedTo500 (theory over both columns, asserting the EF model metadata reports MaxLength == 500) and GrpcNodeAddressColumns_MatchSiblingNodeAddressBounds (asserting the gRPC columns share the bound of the NodeAAddress/NodeBAddress siblings).

ConfigurationDatabase-007 — AuditService does not handle JSON-serialization failure of arbitrary afterState

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Services/AuditService.cs:28-30

Description

LogAsync serializes the caller-supplied afterState object with JsonSerializer.Serialize(afterState) using default options. EF entity POCOs commonly have navigation properties; serializing an entity that has loaded navigations (e.g. a Template with Attributes/Scripts, or any entity with a cycle) will throw JsonException for a reference cycle or produce a very large payload. Because audit writes are designed to commit in the same transaction as the change, a serialization exception thrown here will roll back the entire business operation — a template update fails because its audit entry could not be serialized. This couples audit robustness to the shape of every entity passed in.

Recommendation

Configure JsonSerializerOptions with ReferenceHandler.IgnoreCycles (or Preserve) and a sensible MaxDepth, and consider serializing a projected DTO/snapshot rather than the live tracked entity. Decide explicitly whether an audit serialization failure should fail the operation or be logged and degraded gracefully, and document that decision against the design doc's transactional-guarantee section.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: LogAsync called JsonSerializer.Serialize(afterState) with default options, so any afterState graph containing a reference cycle threw JsonException — and because the audit entry commits in the same transaction as the change it records, that exception rolled back the entire business operation.

Fix applied per the recommendation: AuditService now serializes via a static JsonSerializerOptions configured with ReferenceHandler.IgnoreCycles and MaxDepth = 32. The serialization is additionally wrapped in a SerializeAfterState helper that catches a residual JsonException/NotSupportedException and substitutes a small diagnostic placeholder JSON (AuditSerializationError + StateType) — an explicit decision that an audit-serialization failure must degrade gracefully and never roll back the audited operation. The audit entry is always recorded; the design doc's transactional-guarantee section ("if the change succeeds, the audit entry is always recorded") is thereby honoured even for pathological state objects. Regression test added in AuditServiceTests.cs: LogAsync_AfterStateWithReferenceCycle_DoesNotThrow_AndDoesNotRollBackOperation.

ConfigurationDatabase-008 — GetApprovedKeysForMethodAsync CSV parsing silently drops malformed ids

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/InboundApiRepository.cs:46-58

Description

ApiMethod.ApprovedApiKeyIds is stored as a comma-separated string of integer ids. GetApprovedKeysForMethodAsync splits it, maps each token with int.TryParse(...) ? id : -1, then filters with id > 0. Any token that fails to parse, or a legitimately negative/zero id, is silently discarded. If ApprovedApiKeyIds becomes corrupt (e.g. a stray name instead of an id), the method quietly returns fewer approved keys than expected, which for an API-key authorization path means a method may unexpectedly reject a key that should be approved. Storing a relational many-to-many as a CSV string in a column is itself fragile (no FK integrity, no cascade on key delete).

Recommendation

Short term: log a warning when a token fails to parse instead of silently dropping it, so corruption is observable. Longer term: replace the CSV column with a proper join table (ApiMethodApprovedKey) with foreign keys to ApiMethod and ApiKey, which gives referential integrity and correct cascade behaviour when an API key is deleted.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: GetApprovedKeysForMethodAsync mapped each CSV token with int.TryParse(...) ? id : -1 then filtered id > 0, so any unparseable (or non-positive) token was discarded with no signal — a corrupt ApprovedApiKeyIds value silently approves fewer keys than intended, an authorization-relevant outcome.

Applied the recommendation's short-term fix: the parse loop was rewritten to log a warning for every token that fails to parse to a positive integer, naming the method id and the offending token, so corruption is observable in logs. Valid ids still resolve normally. InboundApiRepository gained an optional ILogger<InboundApiRepository> constructor parameter (defaulting to NullLogger, matching the MigrationHelper pattern) and the project now references Microsoft.Extensions.Logging.Abstractions. The longer-term join-table redesign would change the ApiMethod entity / schema and the IInboundApiRepository contract (Commons, out of this module's scope) and is left as a future schema-design item. Regression tests added in InboundApiRepositoryTests.cs: GetApprovedKeysForMethod_WithMalformedCsvToken_LogsWarningAndDropsToken, GetApprovedKeysForMethod_WithValidCsv_ReturnsAllKeys, and GetApprovedKeysForMethod_WithNullOrEmptyCsv_ReturnsEmptyWithoutWarning (using a capturing ILogger to assert the warning is emitted only on malformed input).

ConfigurationDatabase-009 — Multi-collection eager loads issue cartesian-product queries

Severity Low
Category Performance & resource management
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs:43-51,53-61, src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/CentralUiRepository.cs:45-55

Description

GetAllTemplatesAsync, GetTemplatesComposingAsync, and GetTemplateTreeAsync each Include three-to-four sibling collections (Attributes, Alarms, Scripts, Compositions) in a single query. EF Core's default single-query strategy produces a cartesian-product join across those collections, so a template with N attributes, M alarms, and K scripts yields N×M×K rows that EF must then de-duplicate. For templates with many members this materially inflates the result set and query time. GetInstanceByIdAsync/GetAllInstancesAsync have the same shape with three collections.

Recommendation

Add .AsSplitQuery() to these multi-collection-include queries (or set UseQuerySplittingBehavior(QuerySplittingBehavior.SplitQuery) globally in AddConfigurationDatabase) so each collection is loaded with a separate query and the cartesian explosion is avoided.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: GetAllTemplatesAsync and GetTemplatesComposingAsync (TemplateEngineRepository) and GetTemplateTreeAsync (CentralUiRepository) each Include three-to-four sibling collections in a single query, producing a cartesian-product join. The same shape was also present in GetTemplateByIdAsync, GetInstanceByIdAsync, GetAllInstancesAsync, GetInstancesBySiteIdAsync, and GetInstanceByUniqueNameAsync.

Applied the recommendation's per-query option: .AsSplitQuery() was added to every multi-collection-include query in TemplateEngineRepository (eight call sites) and to GetTemplateTreeAsync in CentralUiRepository, so each collection loads with its own query and the cartesian explosion is avoided. Per-query AsSplitQuery() was preferred over a global UseQuerySplittingBehavior so single-collection queries elsewhere keep the cheaper single-query plan. Split queries change query shape only, not results; regression tests added in SchemaConfigurationTests.cs pin that behaviour: GetAllTemplatesAsync_WithMultipleMembersPerCollection_LoadsAllWithoutDuplication (a template with 3 attributes, 2 alarms, 4 scripts must return exactly those counts — not a 24-row cartesian product) and GetTemplateByIdAsync_WithMultipleMembers_LoadsAllCollections.

ConfigurationDatabase-010 — Several repositories and InstanceLocator lack direct test coverage

Severity Low
Category Testing coverage
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/TemplateEngineRepository.cs, Repositories/DeploymentManagerRepository.cs, Repositories/ExternalSystemRepository.cs, Repositories/InboundApiRepository.cs, Repositories/NotificationRepository.cs, Repositories/SiteRepository.cs, Services/InstanceLocator.cs

Description

The test project covers SecurityRepository, CentralUiRepository, AuditService, optimistic concurrency, seed data, and Data Protection persistence. There are no direct tests for TemplateEngineRepository (the largest repository, and the one with the CD-001 bug, which a test would have caught), DeploymentManagerRepository (including its Local-then-stub delete fallback and the DeleteInstanceAsync restrict-FK-cleanup logic), ExternalSystemRepository, InboundApiRepository (notably GetApprovedKeysForMethodAsync CSV parsing — CD-008), NotificationRepository, SiteRepository (including its stub-attach delete path), or InstanceLocator.

Recommendation

Add repository-level tests using the existing SqliteTestHelper pattern, covering at minimum: CRUD round-trips, the stub-attach delete fallbacks in DeploymentManagerRepository/SiteRepository, DeleteInstanceAsync's explicit deployment-record cleanup, GetApprovedKeysForMethodAsync with valid/malformed CSV, and InstanceLocator.GetSiteIdForInstanceAsync for found/not-found cases.

Resolution

Resolved 2026-05-16 (commit pending). Direct repository/service tests were added using the existing SqliteTestHelper pattern. InboundApiRepositoryTests.cs covers InboundApiRepository (API-key/method CRUD round-trips and the GetApprovedKeysForMethodAsync valid/malformed/empty-CSV cases — see CD-008). RepositoryCoverageTests.cs adds ExternalSystemRepositoryTests (definition/method CRUD, parent-filtered method query, database-connection delete), NotificationRepositoryTests (notification-list-with-recipients and SMTP-configuration round-trips, list delete), SiteRepositoryTests (site/identifier round-trip plus the stub-attach delete fallback exercised for both DeleteSiteAsync and DeleteDataConnectionAsync by clearing the ChangeTracker, and the site-filtered instance query), DeploymentManagerRepositoryTests (deployment-record CRUD and GetCurrentDeploymentStatusAsync ordering, the stub-attach DeleteDeploymentRecordAsync fallback, and DeleteInstanceAsync's explicit Restrict-FK deployment-record cleanup), and InstanceLocatorTests (GetSiteIdForInstanceAsync for the found and not-found cases). TemplateEngineRepository gained the CD-001 and CD-009 regression tests (TemplateEngineRepositoryTests.cs, SchemaConfigurationTests.cs). A constructor null-guard test was added for each of the five repositories/services covered, doubling as the CD-011 regression guard. The full module suite is green.

ConfigurationDatabase-011 — Inconsistent constructor null-guarding across repositories/services

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/ExternalSystemRepository.cs:11-14, Repositories/InboundApiRepository.cs:11-14, Repositories/NotificationRepository.cs:11-14, Services/InstanceLocator.cs:13-16

Description

SecurityRepository, CentralUiRepository, TemplateEngineRepository, DeploymentManagerRepository, SiteRepository, and AuditService all guard their injected ScadaBridgeDbContext with ?? throw new ArgumentNullException(...). ExternalSystemRepository, InboundApiRepository, NotificationRepository, and InstanceLocator assign the constructor argument directly with no guard. This is a minor consistency/maintainability issue: although the DI container will not normally supply null, the divergence makes the codebase look unfinished and means a future hand-constructed instance fails with a less informative NullReferenceException later.

Recommendation

Apply the same ?? throw new ArgumentNullException(nameof(context)) guard in the four inconsistent constructors so all data-access types behave uniformly.

Resolution

Resolved 2026-05-16 (commit pending). Root cause confirmed against source: ExternalSystemRepository, InboundApiRepository, NotificationRepository, and InstanceLocator assigned the injected ScadaBridgeDbContext directly with no null guard, diverging from SecurityRepository/CentralUiRepository/TemplateEngineRepository/ DeploymentManagerRepository/SiteRepository/AuditService. Applied the recommendation: all four constructors now use context ?? throw new ArgumentNullException(nameof(context)) (InboundApiRepository's guard was added as part of its CD-008 constructor change), so every data-access type behaves uniformly and a hand-constructed instance fails with an informative exception at construction rather than a later NullReferenceException. Regression: Constructor_NullContext_Throws tests were added for all four affected types (InboundApiRepositoryTests.cs, RepositoryCoverageTests.cs).

ConfigurationDatabase-012 — Inbound-API ApiKey.KeyValue bearer credential stored in plaintext

Severity Medium
Category Security
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/InboundApiConfiguration.cs:17-19

Description

ApiKey.KeyValue is the bearer credential presented in the X-API-Key header to authenticate Inbound API requests (HighLevelReqs §7.27.3). It is mapped as an ordinary nvarchar(500) column with a unique index and persisted verbatim. Anyone with read access to the configuration database — or to any AuditLogEntry.AfterStateJson into which an ApiKey entity is serialized — obtains live API credentials in cleartext.

ConfigurationDatabase-004 introduced encryption-at-rest for the other secret-bearing columns (SMTP credentials, external-system auth config, database connection strings) but explicitly scoped ApiKey.KeyValue out. The omission is genuine: the EncryptedStringConverter built for CD-004 is backed by ASP.NET Data Protection, which is non-deterministic — the same plaintext encrypts to different ciphertext each time — so it cannot be applied here, because GetApprovedKeysForMethodAsync and the authentication path resolve a key by its value (GetApiKeyByValueAsync does FirstOrDefaultAsync(k => k.KeyValue == keyValue)). A non-deterministic converter would break that equality lookup. The result is that the one credential most exposed to external callers is the one credential left unprotected.

Recommendation

Store a salted cryptographic hash of the key value instead of the plaintext (or ciphertext): hash on create, and authenticate by hashing the presented key and comparing. This keeps the equality lookup working (the hash is deterministic) while ensuring the database never holds a usable credential. The plaintext key would then only ever be shown once, at creation time, to the Admin who created it. This requires a coordinated change with the Inbound API / Security components and the ApiKey entity in Commons; record the chosen scheme in docs/requirements/Component-ConfigurationDatabase.md and the Inbound API design doc. At minimum, ensure ApiKey entities are never passed to IAuditService without the key value redacted.

Resolution

Resolved 2026-05-17 (commit <pending>) — approved cross-module change spanning Commons + ConfigurationDatabase + InboundAPI + ManagementService. The inbound-API bearer credential is now persisted as a deterministic HMAC-SHA256 keyed hash, never as plaintext, so a configuration-database dump no longer yields usable API keys while the constant-time by-value lookup still works.

Root cause confirmed against source: ApiKey.KeyValue was a plaintext nvarchar(500) column with a unique index, holding the live X-API-Key credential.

Design implemented — deterministic keyed hash (the recommendation's first option):

  • Hash scheme. HMAC-SHA256 over the UTF-8 key bytes, keyed with a server-side pepper; the digest is stored Base64-encoded. No per-row random salt — an API key is already a high-entropy random token, and a random salt would break the deterministic by-value lookup the authentication path relies on. The pepper instead binds every hash to the deployment. Implemented as IApiKeyHasher / ApiKeyHasher in ZB.MOM.WW.ScadaBridge.Commons (Types/InboundApi/ApiKeyHasher.cs); the constructor rejects a missing or weak (< 16-char) pepper with ArgumentException — fail-fast.
  • Where the pepper lives. InboundApiOptions.ApiKeyPepper, a component-owned Options class already bound from the ScadaBridge:InboundApi configuration section (Options pattern); it is never hard-coded. AddInboundAPI registers IApiKeyHasher via a factory that reads the bound options, so a missing/weak pepper fails the deployment fast rather than degrading silently. (Operators must supply the pepper via configuration / a secret store; appsettings*.json is intentionally left without a value.)
  • Commons. ApiKey drops the persisted KeyValue property and gains a persisted KeyHash — the plaintext is never a field on the entity. ApiKey.FromHash(name, keyHash) creates an entity from a precomputed hash (the production create-path); the existing ApiKey(name, keyValue) constructor is retained as a convenience that hashes its input with the unpeppered default hasher.
  • ConfigurationDatabase. EF mapping maps KeyHash (nvarchar(256), required, unique index IX_ApiKeys_KeyHash); migration 20260517073000_HashApiKeyValue drops KeyValue/IX_ApiKeys_KeyValue, adds KeyHash, and deletes existing ApiKeys rows (a one-way hash cannot recover plaintext). GetApiKeyByValueAsync now hashes the supplied value and looks up by KeyHash.
  • InboundAPI. ApiKeyValidator takes an IApiKeyHasher; FindKeyConstantTime hashes the presented candidate and compares hashes via CryptographicOperations.FixedTimeEquals — the comparison stays constant-time.
  • ManagementService. HandleCreateApiKey generates a random 32-byte key, stores only its peppered hash, and returns the plaintext to the caller exactly once; HandleListApiKeys no longer exposes the hash (returns id/name/enabled only).

TDD regression tests (written first, confirmed failing, then passing): ApiKeyHasherTests + ApiKeyTests (Commons — determinism, pepper fail-fast, no plaintext property), ApiKeyHashValidationTests (InboundAPI — auth succeeds via hash, wrong key / wrong-pepper fail), ApiKeyCreationTests (ManagementService — only the hash is persisted, plaintext returned once), and SchemaConfigurationTests additions (ConfigurationDatabase — KeyHash mapped/uniquely-indexed, no KeyValue column). dotnet build of all five touched modules is clean and their test suites are green.

Operational consequence: every existing API key must be re-issued. A hash is one-way, so the HashApiKeyValue migration deletes existing key rows; after applying it, operators must create new keys (CLI / Management API / Central UI), distribute the one-time plaintext to callers, and re-approve the new keys on their methods.

Follow-up (outside this task's editable scope): record the hash scheme in docs/requirements/Component-ConfigurationDatabase.md and the Inbound API design doc, and update the Central UI API-keys page, which previously displayed a masked KeyValue.

ConfigurationDatabase-013 — Secret-column encryption silently falls back to an ephemeral (throwaway) key

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/ScadaBridgeDbContext.cs:107-124

Description

ApplySecretColumnEncryption resolves the Data Protection provider as _dataProtectionProvider ?? new EphemeralDataProtectionProvider(). The ?? fallback is reached whenever the context is constructed via the single-argument ScadaBridgeDbContext(DbContextOptions) constructor — i.e. whenever no provider was injected. An EphemeralDataProtectionProvider generates a key ring that lives only in process memory and is discarded at process exit.

For design-time dotnet ef tooling this is harmless (the XML remark correctly notes it only emits schema). The risk is on a runtime write path. The runtime currently gets the provider-bearing context only because AddConfigurationDatabase adds an AddScoped factory registration that overrides EF's activator-based registration. That override is the single thing standing between correct behaviour and silent data corruption: any future change that resolves a ScadaBridgeDbContext through a path the override does not cover — an AddPooledDbContextFactory/IDbContextFactory<ScadaBridgeDbContext> registration, a second AddDbContext call, a hand-constructed context in server code — would construct the context with the single-arg constructor, encrypt secret columns with a throwaway key, and persist ciphertext that becomes permanently undecryptable the moment the process restarts. There is no exception, no warning; the failure only surfaces later as CryptographicException on read (mis-attributed by EncryptedStringConverter to "the stored value was not written by this system").

Recommendation

Do not silently substitute an ephemeral provider for write-capable contexts. Either: (a) require the provider unconditionally and have design-time tooling pass an explicit ephemeral provider so the intent is visible at the call site; or (b) keep the single-arg constructor but mark contexts built without a real provider as schema-only — e.g. record a flag and have the encrypting converter throw a clear InvalidOperationException ("secret columns cannot be written without a configured Data Protection key ring") on the first Protect, instead of producing throwaway ciphertext. Also harden the DI wiring so a ScadaBridgeDbContext cannot be resolved through the EF-activator registration at all (e.g. register only the factory, or use AddDbContextFactory with the explicit constructor).

Resolution

Resolved 2026-05-17 (commit pending). Root cause confirmed against source: ApplySecretColumnEncryption resolved the provider as _dataProtectionProvider ?? new EphemeralDataProtectionProvider(), so a context built via the single-argument constructor would silently encrypt secret columns with a throwaway key and persist permanently undecryptable ciphertext with no error.

Applied the recommendation — a combination of options (a) and (b). The silent ephemeral substitution was removed: a no-provider context now builds its model with an internal SchemaOnlyDataProtector that satisfies schema generation but throws a clear InvalidOperationException if it is ever asked to Protect/Unprotect. On top of that, SaveChanges/SaveChangesAsync were overridden with a GuardSecretWritesHaveAKeyRing check that fails fast — before any database round-trip — with a clear InvalidOperationException (naming the offending Entity.Property) whenever a no-provider context has a pending non-null write to one of the three secret-bearing columns. A custom SecretAwareModelCacheKeyFactory (registered via OnConfiguring/ReplaceService) folds the presence of a real provider into EF's model cache key, so a write-capable and a schema-only context can no longer share one cached model. Design-time dotnet ef tooling (which only emits schema) and the test fixture both keep working — SqliteTestDbContext now passes an explicit EphemeralDataProtectionProvider, making the test intent visible at the call site as recommendation (a) asks. Regression tests added in EphemeralEncryptionFallbackTests.cs: SingleArgConstructor_WritingSecretColumn_FailsFast_DoesNotPersistThrowawayCiphertext, SingleArgConstructor_WritingNonSecretColumn_Succeeds, and ProviderConstructor_WritingSecretColumn_StillSucceeds. The DI-hardening sub-point (register only the factory) was not adopted — AddConfigurationDatabase already overrides the EF-activator registration with a provider-bearing factory, and the fail-fast guard now closes the residual gap for any other resolution path.

ConfigurationDatabase-014 — Redundant, inconsistent cast on one HasConversion call

Severity Low
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/ScadaBridgeDbContext.cs:121-123

Description

ApplySecretColumnEncryption calls HasConversion(converter) three times. The first two (SmtpConfiguration.Credentials, ExternalSystemDefinition.AuthConfiguration) pass converter directly; the third (DatabaseConnectionDefinition.ConnectionString) casts it to (Microsoft.EntityFrameworkCore.Storage.ValueConversion.ValueConverter). EncryptedStringConverter already derives from ValueConverter<string?, string?> (itself a ValueConverter), and the first two call sites compile fine without the cast, so the cast is redundant. The inconsistency makes the code look as though the third call needs special handling when it does not, and the fully-qualified type name inline adds noise.

Recommendation

Remove the cast so all three calls read identically as HasConversion(converter). If a ValueConverter-typed reference is genuinely wanted, give it a local variable of that type once and use it for all three.

Resolution

Resolved 2026-05-17 (commit pending). Partially re-triaged: the finding is correct that the inline fully-qualified cast was inconsistent and noisy, but its premise that the cast is purely redundant is wrong. DatabaseConnectionDefinition.ConnectionString is a non-nullable string; EncryptedStringConverter is ValueConverter<string?, string?>, so calling HasConversion(converter) with converter typed as EncryptedStringConverter on that property raises a real CS8620 nullability error (verified by build). The cast to the non-generic ValueConverter base was suppressing that — load-bearing, not noise. Applied the recommendation's second option: the converter is now held once in a single ValueConverter-typed local (with an explanatory comment), and all three HasConversion(converter) calls read identically with no inline cast or fully-qualified name. Behaviour is unchanged, so no behavioural regression test is meaningful (cf. CD-005); a forward guard was added in SchemaConfigurationTests.csSecretColumns_AllHaveEncryptedStringConverterApplied (theory over all three secret columns) — asserting each column keeps an EncryptedStringConverter.

ConfigurationDatabase-015 — NotificationOutboxRepository.InsertIfNotExistsAsync is a check-then-act race with no duplicate-key catch

Severity High
Category Concurrency & thread safety
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45

Resolution — rewrote InsertIfNotExistsAsync as a single raw-SQL IF NOT EXISTS (...) INSERT matching the AuditLogRepository.InsertIfNotExistsAsync and SiteCallAuditRepository.UpsertAsync patterns, with a SqlException catch on numbers 2601 (unique-index violation) and 2627 (primary-key/unique-constraint violation) returning false. Concurrent losers are logged at Debug and treated as no-ops, eliminating the site-retry livelock. Two SQLite-targeted assertions in RepositoryCoverageTests were migrated to a new MS SQL-fixture file tests/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.Tests/Repositories/NotificationOutboxRepositoryIntegrationTests.cs, which also adds a 50-way parallel race test verifying exactly one row lands and no exception bubbles.

Description

InsertIfNotExistsAsync does AnyAsync(x => x.NotificationId == n.NotificationId), then — if false — AddAsync + SaveChangesAsync. There is a check-then-act window between the two operations: two sessions can both pass the AnyAsync check and both attempt the INSERT, and the loser surfaces as a uniqueness violation on the NotificationId primary key wrapped in a DbUpdateException / SqlException (error 2627). The site→central handoff for notifications is documented as at-least-once with ack-after-persist plus insert-if-not-exists; collisions on the same NotificationId are therefore not a "should never happen" but the expected contention mode. As written, the second concurrent ack throws, fails the site→central acknowledgement, and the site retries the same row again on its next forward — a livelock if the contending pair keeps racing.

The sibling raw-SQL IF NOT EXISTS … INSERT paths in AuditLogRepository.InsertIfNotExistsAsync (see SqlErrorUniqueIndexViolation / SqlErrorPrimaryKeyViolation handling at AuditLogRepository.cs:74-89) and SiteCallAuditRepository.UpsertAsync (SiteCallAuditRepository.cs:87-96) explicitly catch errors 2601/2627 and treat the loser as a no-op — exactly the right pattern for "first-write-wins idempotent ingest". This repository alone does not.

Recommendation

Either (a) rewrite the body as a single raw-SQL IF NOT EXISTS … INSERT and apply the same 2601/2627 catch-and-log-Debug pattern the AuditLog and SiteCall repositories use, or (b) wrap the existing flow in a try/catch around SaveChangesAsync that inspects the inner SqlException.Number and returns false (i.e. "another writer won the race") on 2601/2627. Option (a) is preferable because it collapses the two round-trips to one and matches the established idempotent-ingest pattern used elsewhere in the module. Add a regression test that simulates two concurrent InsertIfNotExistsAsync calls (using two open contexts) for the same NotificationId and asserts neither call throws and exactly one row lands.

ConfigurationDatabase-016 — InboundApiRepository.GetApiKeyByValueAsync hashes the candidate with the unpeppered ApiKeyHasher.Default

Severity Medium
Category Security
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/InboundApiRepository.cs:35-39

Resolution (2026-05-28): Took option (a) — InboundApiRepository ctor now accepts Func<IApiKeyHasher>? hasherAccessor = null (deferred resolution to sidestep startup-time pepper-validation in test composition roots that don't configure one). GetApiKeyByValueAsync calls _hasherAccessor() so the hash matches the production peppered digest. AddConfigurationDatabase registers the repository with a factory wiring () => sp.GetService<IApiKeyHasher>() ?? ApiKeyHasher.Default — the peppered hasher is used when available, falling back to Default only when none is registered (legacy / test composition). Regression test CD016_GetApiKeyByValue_UsesInjectedPepperedHasher_NotDefault asserts positively (peppered roundtrip) and negatively (Default hasher misses the key entirely, proving the lookup uses the injected hasher).

Description

GetApiKeyByValueAsync resolves an API key by its presented plaintext value by hashing the candidate and looking up KeyHash. The hash, however, is computed with the static ApiKeyHasher.Default (the fixed, deployment-independent unpeppered hasher used for tests). Production key creation uses the DI-registered, peppered IApiKeyHasher constructed from InboundApiOptions.ApiKeyPepper (see CD-012 resolution and ApiKeyHasher.ctor(string pepper)), so the stored KeyHash of any real key was produced under the deployment pepper. Hashing the candidate with the unpeppered Default yields a different digest, and the WHERE KeyHash = @hash lookup will never match a real key.

The production ApiKeyValidator (InboundAPI module) deliberately does NOT call this method — it fetches all keys and runs a constant-time comparison via the DI-registered hasher (ApiKeyValidator.cs:53-64) — so the immediate authentication path is unaffected. But GetApiKeyByValueAsync remains a publicly exposed IInboundApiRepository member; any new caller (a future admin tool, a CLI command, a test) that uses it under a peppered configuration will silently get a null result for an existing, valid key, and almost certainly mis-route the failure as "key not found".

Recommendation

Either (a) take IApiKeyHasher via constructor injection — alongside the existing ScadaBridgeDbContext and optional ILogger — and use it here so the repository participates in the same peppered scheme as the rest of the system; or (b) delete the method from both the implementation and IInboundApiRepository (Commons) on the grounds that the production authentication path correctly avoids it for timing reasons and there is no remaining valid caller. Add a regression test that constructs the repository under a real ApiKeyHasher("a-strong-pepper-value"), inserts an ApiKey.FromHash(...) using the same hasher, and asserts GetApiKeyByValueAsync returns the row — under option (a) it should pass; under option (b) the method no longer exists.

ConfigurationDatabase-017 — Stub-attach delete on DeploymentRecord bypasses optimistic concurrency

Severity Medium
Category Concurrency & thread safety
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:83-97

Resolution (2026-05-28): IDeploymentManagerRepository.DeleteDeploymentRecordAsync now requires a byte[] expectedRowVersion argument. The repository seeds entry.OriginalValues["RowVersion"] on both the Local-tracked and stub-attach branches so EF emits DELETE ... WHERE Id = @id AND RowVersion = @prior and surfaces a concurrent edit as DbUpdateConcurrencyException. A new SQLite regression test DeleteDeploymentRecord_StaleRowVersion_ThrowsConcurrencyException in ConcurrencyTests (backed by a dedicated RowVersionConcurrencyTestDbContext that keeps RowVersion as a caller-managed concurrency token) asserts the exception fires on a stale token; the existing DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity was updated to pass the fresh RowVersion.

Description

DeploymentRecord carries a SQL Server rowversion concurrency token (declared in DeploymentConfiguration and confirmed by ConcurrencyTests), per the design doc's "Optimistic concurrency is used on deployment status records". When DeleteDeploymentRecordAsync falls into its stub-attach branch (no tracked entity in _dbContext.DeploymentRecords.Local for the given id), it constructs new DeploymentRecord("stub", "stub") { Id = id }, Attaches it, and Removes it. The stub's RowVersion is left at its default null (or byte[0]).

EF Core's SQL Server provider generates the delete as DELETE FROM DeploymentRecords WHERE Id = @id AND RowVersion = @stubRowVersion — and the stub rowversion is not the row's real rowversion, so on a real SQL Server (with IsRowVersion() auto-populating the column) the WHERE never matches and SaveChanges throws DbUpdateConcurrencyException. The path is exercised by RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity — but the test fixture remaps RowVersion as a nullable IsConcurrencyToken() column without auto-population (SqliteTestHelper.ConfigureForTests), so the stored RowVersion is null AND the stub's RowVersion is null AND the SQLite delete matches. Production-shape behaviour is the opposite.

The same stub-attach pattern is used on SystemArtifactDeploymentRecord, Site, and DataConnection. Those entities have no rowversion token, so the production behaviour is correct for them — the issue is specific to DeploymentRecord.

Recommendation

Replace the stub-attach branch in DeleteDeploymentRecordAsync with a real lookup — await _dbContext.DeploymentRecords.FindAsync([id], ct) then Remove if non-null — mirroring DeleteInstanceAttributeOverrideAsync and DeleteDeployedSnapshotAsync. This loses the "delete by id without a read" micro-optimisation (a real concern only in batched-delete loops) but restores the documented concurrency contract. If the optimisation is genuinely required, attach a DeploymentRecord with the caller's known RowVersion (the caller had to fetch the row at some point) and accept the DbUpdateConcurrencyException as the correct concurrency signal. Add a regression test under MS SQL (extend RepositoryCoverageTests with a SQL-Server-flavoured fixture, or use MsSqlMigrationFixture) that asserts the stub-attach delete works when the real RowVersion is supplied.

ConfigurationDatabase-018 — DateTime-typed *Utc columns on AuditEvent / SiteCall carry no DateTimeKind enforcement

Severity Medium
Category Code organization & conventions
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs, Configurations/SiteCallEntityTypeConfiguration.cs (mappings for OccurredAtUtc, IngestedAtUtc, CreatedAtUtc, UpdatedAtUtc, TerminalAtUtc)

Resolution (2026-05-28): Added two private static ValueConverter<DateTime, DateTime> / ValueConverter<DateTime?, DateTime?> UTC-enforcing converters to AuditLogEntityTypeConfiguration and applied them to AuditEvent.OccurredAtUtc and AuditEvent.IngestedAtUtc via HasConversion(...). The converter re-tags DateTimeKind.Utc on hydrate (where SQL Server's datetime2 provider strips the Kind flag) and on write (so a producer-supplied Kind=Unspecified literal still lands as UTC in the model cache). Coordinates with the sibling Commons-019 resolution (init-setter on AuditEvent re-tags Kind=Utc at construction). Regression test in tests/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.Tests/Configurations/AuditLogEntityTypeConfigurationTests.cs::Configure_UtcConverter_HydratesOccurredAtUtcAsKindUtc inserts an Unspecified-Kind value, re-reads through a cleared change-tracker, and asserts Kind == Utc on both columns. The SiteCall mapping is out of scope for this close (sibling component task).

Description

AuditEvent.OccurredAtUtc / IngestedAtUtc and SiteCall.CreatedAtUtc / UpdatedAtUtc / TerminalAtUtc / IngestedAtUtc are declared as DateTime (not DateTimeOffset) per the Audit Log #23 spec, with a UTC suffix convention. SQL Server's datetime2 provider strips the Kind flag on the wire — values inserted with DateTimeKind.Utc round-trip as DateTimeKind.Unspecified on read. The EF mappings add no HasConversion(...) to normalise the kind. The sibling Commons module just flagged the same pattern as Commons-019; in this module the consequence is concrete:

  • AuditLogPartitionMaintenance.GetMaxBoundaryAsync already defends with an explicit DateTime.SpecifyKind(dt, DateTimeKind.Utc) (see AuditLogPartitionMaintenance.cs:103-104). That defence is necessary precisely because the EF mapping does not enforce it.
  • AuditLogRepository.GetPartitionBoundariesOlderThanAsync does NOT defend — it returns reader.GetDateTime(0) directly with Kind=Unspecified (separate finding CD-020).
  • Downstream comparisons like DateTime.UtcNow (Kind=Utc) against a re-read OccurredAtUtc (Kind=Unspecified) do not produce a runtime error, but any code path that converts via .ToLocalTime() or .ToUniversalTime() will silently interpret an unspecified-kind value as local time and produce wrong results.

Recommendation

Apply a value converter on every DateTime-typed *Utc column that re-tags the Kind to Utc on read (and asserts/SpecifyKind on write to defend against an accidental local-kind write). EF Core's built-in UtcValueConverter-style pattern is a single line per column:

builder.Property(e => e.OccurredAtUtc)
    .HasConversion(
        v => v,
        v => DateTime.SpecifyKind(v, DateTimeKind.Utc));

Apply uniformly to AuditEvent (OccurredAtUtc, IngestedAtUtc), SiteCall (CreatedAtUtc, UpdatedAtUtc, TerminalAtUtc, IngestedAtUtc), and any other DateTime *Utc columns added later. Add a regression test that inserts a UTC row, re-reads it in a fresh context, and asserts Kind == DateTimeKind.Utc. Coordinate with the sibling Commons-019 finding so the resolution is consistent across both modules.

ConfigurationDatabase-019 — EnsureLookaheadAsync swallows non-idempotent SPLIT failures and continues, creating partition holes

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Maintenance/AuditLogPartitionMaintenance.cs:181-199

Resolution (2026-05-28): Took option (a) — dropped the try/catch (SqlException) around the per-month SPLIT loop entirely (and the now-unused using Microsoft.Data.SqlClient). By class-doc construction the catch could never fire for "boundary already exists" (the loop pre-reads max-boundary and only issues SPLITs for strictly-greater months), so its only effect was to mask real failures (permission revoked, deadlock victim, log full, filegroup full) and let the next iteration split the following month — leaving a permanent partition hole. Now any SqlException propagates out of EnsureLookaheadAsync, surfaces to the central daily-tick hosted service as an Error, and the next tick retries from the same boundary (at-least-once, no holes). Replaced the catch block with an inline comment explaining the rationale so a future maintainer doesn't reintroduce it.

Description

EnsureLookaheadAsync loops one month at a time from next up to horizon and issues ALTER PARTITION SCHEME … NEXT USED + ALTER PARTITION FUNCTION … SPLIT RANGE per month. The class doc says idempotency is guaranteed by reading the max-boundary first and only issuing SPLITs for strictly-greater months — so "boundary already exists" (SQL Server msg 7708/7711) cannot occur by construction. Yet the loop wraps each iteration in catch (SqlException ex) { _logger.LogWarning(...); } and continues, with the rationale "the desired end state (boundary present) is satisfied by either path."

That rationale is correct only for an "already-exists" error — which the pre-check makes impossible. Any other SqlException — a permissions failure (the scadabridge_audit_purger role's ALTER ON SCHEMA::dbo revoked or not granted), a deadlock victim, a transient connection drop, a transaction log full, an underlying filegroup full — leaves the boundary genuinely not created, logs a Warning (quiet by default in most appenders), and the next iteration tries to SPLIT the following month. That split can succeed (it is a different range value), creating a permanent hole in the partition layout: month N never had a partition created, month N+1 does, so any future row in month N lands in the partition that previously spanned both months and partition-switch purge for month N becomes impossible.

The class is the central singleton's daily-tick partition roll-forward, so the hole persists until an operator notices it and rebuilds manually — by which point months of audit retention may be locked behind the unsplit range.

Recommendation

Either (a) drop the try/catch entirely so any SPLIT failure aborts the loop and surfaces to the hosted service (the next tick retries — at-least-once with no holes), or (b) keep the catch but narrow it to ONLY the "boundary-already-exists" errors (SQL Server msg 7708 and 7711) and log at Debug, mirroring how AuditLogRepository.InsertIfNotExistsAsync narrowly catches 2601/2627. Option (a) is preferable: by class-doc construction the catch should never fire, so its only effect is to mask the real-failure case. Add tests that simulate a SPLIT failure (e.g. a permission denial via a constrained test login) and assert the loop aborts after the first failure with no further SPLITs.

ConfigurationDatabase-020 — GetPartitionBoundariesOlderThanAsync returns DateTime with Kind=Unspecified

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/AuditLogRepository.cs:378-387

Resolution (2026-05-28): Wrapped the reader.GetDateTime(0) read with DateTime.SpecifyKind(..., DateTimeKind.Utc) so each returned boundary now carries Kind=Utc, matching the explicit defensive pattern already in AuditLogPartitionMaintenance.GetMaxBoundaryAsync. Added an inline comment explaining the rationale (SQL Server datetime2 strips Kind through ADO.NET; boundary values are stored UTC). With sibling CD-018 also closed, the EF read path now enforces UTC at the column level — the raw-ADO defence here is belt-and-braces for this method, which bypasses EF entirely.

Description

GetPartitionBoundariesOlderThanAsync reads reader.GetDateTime(0) and adds the raw value to the returned list. SQL Server's datetime2 materialises as DateTimeKind.Unspecified on the ADO.NET side (see CD-019), so every returned boundary has Kind=Unspecified. The sibling AuditLogPartitionMaintenance.GetMaxBoundaryAsync (AuditLogPartitionMaintenance.cs:103-104) explicitly defends against this exact issue by calling DateTime.SpecifyKind(dt, DateTimeKind.Utc) — exactly because EF / ADO.NET strips the kind — but the repository method does not. Callers (the AuditLogPurgeActor) that compare a returned boundary to DateTime.UtcNow get a silently wrong comparison if they ever serialise to/from a string with a local-kind assumption in between.

Recommendation

Wrap the read with DateTime.SpecifyKind(reader.GetDateTime(0), DateTimeKind.Utc), matching the explicit defensive pattern already in AuditLogPartitionMaintenance.GetMaxBoundaryAsync. Better still: fix CD-019 (a value converter on the column) so the defence at the read site is no longer required.

ConfigurationDatabase-021 — SwitchOutPartitionAsync interpolates monthBoundary / staging table name into raw SQL

Severity Low
Category Security
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/AuditLogRepository.cs:192-338

Resolution (2026-05-28): Took the targeted (1) part of the recommendation — the monthBoundary format string is now "yyyy-MM-dd HH:mm:ss.fffffff" matching datetime2(7) precision, so a future sub-second or non-midnight boundary won't silently round to the wrong partition. The staging table name is still interpolated (T-SQL DDL identifiers can't be parameterised) but remains internally constructed as $"AuditLog_Staging_{Guid.NewGuid():N}" so SQL injection is structurally impossible. The larger sp_executesql-with-typed-parameter migration was scoped to a future follow-up since the current shape is fully controlled and the precision- mismatch hazard was the only practical concern.

Description

SwitchOutPartitionAsync builds two large SQL batches via interpolated strings (sampleSql and sql) that include {monthBoundaryStr} and {stagingTableName} directly in the SQL text, and executes them via ExecuteSqlRawAsync / cmd.ExecuteScalarAsync. Both values are constructed inside the method — monthBoundaryStr = monthBoundary.ToUniversalTime().ToString("yyyy-MM-dd HH:mm:ss") and stagingTableName = $"AuditLog_Staging_{Guid.NewGuid():N}" — and the formats are fully controlled. SQL injection is therefore not possible as the code stands.

Two related concerns:

  1. The format string "yyyy-MM-dd HH:mm:ss" truncates fractional seconds. The partition function is seeded at T00:00:00 exactly, so truncation happens to produce the right boundary value today. A future change that adds a sub-second boundary (or invokes SwitchOutPartitionAsync with a non-midnight value) would silently round to the wrong partition with no error — and SWITCH PARTITION would either fail loudly or succeed against the wrong month. Use "yyyy-MM-dd HH:mm:ss.fffffff" to match the precision the migration seeds at, and the rounding ambiguity disappears.
  2. The pattern of "build a multi-statement DDL batch by string concatenation" is robust today only by inspection. A code review tripwire — the CLAUDE.md note "the data-access layer must not concatenate SQL" — would catch the pattern earlier; converting the batch to a parameterised sp_executesql invocation (the inner EXEC sp_executesql @sql already exists for the SWITCH itself) is the textbook safe form even when the input is internally controlled.

Recommendation

(1) Switch monthBoundaryStr's format to "yyyy-MM-dd HH:mm:ss.fffffff". (2) Optionally migrate the two batches to fully parameterised sp_executesql form so the monthBoundary value flows as a typed @boundary datetime2(7) parameter rather than as interpolated text — the only piece that genuinely cannot be parameterised is the staging table identifier (DDL identifiers are not parameterisable in T-SQL), but a server-side QUOTENAME(@stagingTable) wrapper covers it. Add a regression test that supplies a non-midnight monthBoundary value and asserts the boundary lookup resolves to the expected partition.

ConfigurationDatabase-022 — Stale "WP-24 Stub level sufficient for diff/staleness support" XML comment on DeploymentManagerRepository

Severity Low
Category Documentation & comments
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Repositories/DeploymentManagerRepository.cs:8-14

Description

The class-level XML doc on DeploymentManagerRepository reads "WP-24: Stub level sufficient for diff/staleness support." WP-24 (Deployment Manager work-package) shipped long ago; the repository now covers full DeploymentRecord CRUD, SystemArtifactDeploymentRecord CRUD, DeployedConfigSnapshot CRUD, and an Instance deletion path with explicit Restrict-FK cleanup (DeleteInstanceAsync at line 210-229). The comment misleads a reader into thinking the repository is incomplete and tempts them not to investigate further before adding new behaviour. The same module-context observation was noted but not raised in the prior review.

Recommendation

Remove the WP-24 line and rewrite the class doc to describe what the repository actually does today: EF Core implementation of IDeploymentManagerRepository covering deployment records, system-artifact deployment records, deployed config snapshots, and the Restrict-FK-aware DeleteInstanceAsync for the deployment pipeline. Cross-reference the optimistic-concurrency contract on DeploymentRecord.RowVersion.

Resolution (2026-05-28):

Replaced the stale "WP-24: Stub level sufficient for diff/staleness support" XML doc with an accurate one-paragraph summary covering DeploymentRecord CRUD (plus the RowVersion optimistic-concurrency contract), SystemArtifactDeploymentRecord CRUD, DeployedConfigSnapshot CRUD, and the Restrict-FK-aware DeleteInstanceAsync. No behaviour change.

ConfigurationDatabase-023 — AuditLog correlation-index name drifts from design doc (IX_AuditLog_CorrelationId vs IX_AuditLog_Correlation)

Severity Low
Category Design-document adherence
Status Resolved
Location src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/Configurations/AuditLogEntityTypeConfiguration.cs:99-101, Migrations/20260520142214_AddAuditLogTable.cs:103-107

Description

The Component-ConfigurationDatabase design doc lists the AuditLog indexes by name — including IX_AuditLog_Correlation (CorrelationId) for the "drilldown from a single operation" use case. The implemented index name is IX_AuditLog_CorrelationId (the fluent-config HasDatabaseName call and the matching DDL in the migration both use the Id-suffixed form). The names are syntactically valid SQL Server index names and the index does the right work; the drift is cosmetic but it breaks scripted maintenance ops that grep for the documented name (e.g. a runbook reindex script).

The other four documented index names (IX_AuditLog_OccurredAtUtc, IX_AuditLog_Site_Occurred, IX_AuditLog_Channel_Status_Occurred, IX_AuditLog_Target_Occurred, plus the post-design additions IX_AuditLog_Execution, IX_AuditLog_ParentExecution, IX_AuditLog_Node_Occurred) agree with the code.

Recommendation

Pick one direction. Updating the design doc to match the code is cheap (one word) and preserves the existing migration; renaming the index in the database requires a new migration that does sp_rename. Document-aligning is the lower-cost option and matches the resolution pattern used for CD-005.

Resolution (2026-05-28):

Doc-aligns-to-code, preserving the existing migration. docs/requirements/Component-AuditLog.md already lists the index as IX_AuditLog_CorrelationId (line 109). The stale IX_AuditLog_Correlation reference in docs/requirements/Component-ConfigurationDatabase.md line 64 (AuditLog table prose) was updated to IX_AuditLog_CorrelationId. Code unchanged; index name remains IX_AuditLog_CorrelationId.

ConfigurationDatabase-024 — Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete

Severity Low
Category Testing coverage
Status Resolved
Location tests/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.Tests/Maintenance/AuditLogPartitionMaintenanceTests.cs, tests/.../RepositoryCoverageTests.cs:855-869

Resolution (2026-05-28): (1) Added AuditLogPartitionMaintenanceTests.EnsureLookahead_SecondSplitThrows_LoopAborts_FirstBoundaryStillCommitted (Skippable, MS SQL fixture) — installs a DbCommandInterceptor that lets the 1st ALTER PARTITION FUNCTION pf_AuditLog_Month() SPLIT RANGE through and throws on the 2nd, asserts the exception propagates (CD-019's no-try/catch behaviour), counts exactly one successful split, and verifies the first boundary IS now persisted in pf_AuditLog_Month so the next tick resumes from N+1 with no holes. (2) Added DeploymentManagerRepositoryTests.DeleteDeploymentRecord_CurrentRowVersion_StubAttachPath_DeleteSucceeds — production-shape happy path: caller holds the current RowVersion, change-tracker cleared, delete completes without throwing DbUpdateConcurrencyException and the row is gone (1 row affected).

Description

AuditLogPartitionMaintenanceTests exercises the happy-path SPLIT-RANGE behaviour (no-op, single-month, three-month, already-exists idempotency) but never simulates a SPLIT failure — so the catch-and-continue behaviour flagged in CD-019 is behaviourally untested. The class is a central singleton driving daily audit purge; a regression that turned the failure path into a permanent hole would not surface in the test suite.

Separately, RepositoryCoverageTests.DeleteDeploymentRecord_ViaStubAttachPath_RemovesEntity covers the stub-attach delete path under the SQLite test fixture, but the fixture remaps RowVersion as a nullable concurrency token (SqliteTestHelper), so it does not exercise the production-shape IsRowVersion() auto-population — the actual concurrency-token bug flagged in CD-018 cannot show up. There is an MsSqlMigrationFixture in the test project already (used by the Audit Log migration tests); the stub-attach delete deserves a parallel MS-SQL-flavoured test.

Recommendation

(1) Add an AuditLogPartitionMaintenanceTests case that constructs a context against a constrained login (no ALTER ON SCHEMA::dbo), invokes EnsureLookaheadAsync for a three-month gap, and asserts: only the partition boundaries created BEFORE the permissions failure landed remain, and the call aborts cleanly without continuing to later months. This pins down the resolution of CD-019. (2) Add a RepositoryCoverageTests case that uses MsSqlMigrationFixture to insert a DeploymentRecord, clear the change tracker, call DeleteDeploymentRecordAsync, and assert the row is gone — pinning the resolution of CD-018. Both tests should be [SkippableFact] so the suite still passes when no MS SQL Server is available.