docs(components): reference docs batch 1/4 — Commons, ConfigurationDatabase, Communication, ClusterInfrastructure, Host, Security
This commit is contained in:
@@ -0,0 +1,290 @@
|
||||
# Cluster Infrastructure
|
||||
|
||||
The Cluster Infrastructure component manages Akka.NET cluster formation, active/standby failover, split-brain resolution, and the singleton hosting that all other ScadaBridge components depend on. Every site and central cluster is a two-node active/standby pair governed by the same configuration contract and bootstrap logic.
|
||||
|
||||
## Overview
|
||||
|
||||
Cluster Infrastructure (#13) is a **design responsibility** spanning two projects rather than a single buildable project:
|
||||
|
||||
- **`src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/`** owns the cluster configuration contract: `ClusterOptions` (seed nodes, failure-detection timings, split-brain settings), `ClusterOptionsValidator`, and the `AddClusterInfrastructure` DI extension that registers the validator. It does not start an actor system.
|
||||
- **`src/ZB.MOM.WW.ScadaBridge.Host/`** owns the cluster bootstrap and runtime wiring: `AkkaHostedService` builds the Akka HOCON from `ClusterOptions` and `NodeOptions`, starts the `ActorSystem`, wires `CoordinatedShutdown`, and creates all role-specific actors including the cluster singletons.
|
||||
|
||||
This split is deliberate. The Host is the single deployable binary and the only project that performs Akka.NET bootstrap, so all cluster bring-up lives there. `ClusterInfrastructure` is the portable configuration contract that the Host consumes — it can be referenced by tests and other components without pulling in the Host.
|
||||
|
||||
Both central and site clusters run this same topology: two nodes, one active (cluster leader), one standby, with automatic failover and no manual intervention required for dual-node recovery.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Active/standby via cluster leadership
|
||||
|
||||
Akka.NET cluster leadership determines which node is "active". The cluster leader is the oldest node in the cluster, as tracked by the keep-oldest split-brain resolver. `ActiveNodeGate` (in the Host) exposes `IsActiveNode` by checking whether `cluster.SelfMember.Status == MemberStatus.Up` and `cluster.State.Leader == cluster.SelfAddress`. Cluster singletons — which run on the oldest `Up` member — automatically migrate to the surviving node on failover.
|
||||
|
||||
### Configuration contract vs. bootstrap split
|
||||
|
||||
`ClusterOptions` holds the cluster-wide formation and failure-detection settings. Node-identity settings — remoting hostname/port, role (`Central` or `Site`), site identifier, gRPC port — live in `NodeOptions` (`ScadaBridge:Node` section), owned by the Host. This split prevents the configuration contract from acquiring a hard dependency on Host-specific concerns.
|
||||
|
||||
### Singleton hosting
|
||||
|
||||
Cluster Infrastructure provides the hosting platform; each singleton is owned and created by the component responsible for it. The Host's `RegisterCentralActors` and `RegisterSiteActorsAsync` methods wire every singleton via `ClusterSingletonManager` and a companion `ClusterSingletonProxy` so other actors can address it through a stable path regardless of which node currently hosts it.
|
||||
|
||||
## Architecture
|
||||
|
||||
### HOCON assembly
|
||||
|
||||
`AkkaHostedService.BuildHocon` constructs the Akka HOCON document from the bound options at startup. All interpolated values pass through `QuoteHocon` (string escaping) and `DurationHocon` (millisecond rendering) so the document is never corrupted by hostnames or timing values containing special characters or sub-second precision:
|
||||
|
||||
```csharp
|
||||
public static string BuildHocon(
|
||||
NodeOptions nodeOptions,
|
||||
ClusterOptions clusterOptions,
|
||||
IEnumerable<string> roles,
|
||||
TimeSpan transportHeartbeat,
|
||||
TimeSpan transportFailure)
|
||||
{
|
||||
var seedNodesStr = string.Join(",",
|
||||
clusterOptions.SeedNodes.Select(QuoteHocon));
|
||||
var rolesStr = string.Join(",", roles.Select(QuoteHocon));
|
||||
|
||||
return $@"
|
||||
audit-telemetry-dispatcher {{
|
||||
type = ForkJoinDispatcher
|
||||
throughput = 100
|
||||
dedicated-thread-pool {{
|
||||
thread-count = 2
|
||||
}}
|
||||
}}
|
||||
akka {{
|
||||
actor {{
|
||||
provider = cluster
|
||||
}}
|
||||
cluster {{
|
||||
seed-nodes = [{seedNodesStr}]
|
||||
roles = [{rolesStr}]
|
||||
min-nr-of-members = {clusterOptions.MinNrOfMembers}
|
||||
split-brain-resolver {{
|
||||
active-strategy = {QuoteHocon(clusterOptions.SplitBrainResolverStrategy)}
|
||||
stable-after = {DurationHocon(clusterOptions.StableAfter)}
|
||||
keep-oldest {{
|
||||
down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}
|
||||
}}
|
||||
}}
|
||||
failure-detector {{
|
||||
heartbeat-interval = {DurationHocon(clusterOptions.HeartbeatInterval)}
|
||||
acceptable-heartbeat-pause = {DurationHocon(clusterOptions.FailureDetectionThreshold)}
|
||||
}}
|
||||
run-coordinated-shutdown-when-down = on
|
||||
}}
|
||||
coordinated-shutdown {{
|
||||
run-by-clr-shutdown-hook = on
|
||||
}}
|
||||
}}";
|
||||
}
|
||||
```
|
||||
|
||||
The HOCON also defines the `audit-telemetry-dispatcher` (a two-thread `ForkJoinDispatcher`) so `SiteAuditTelemetryActor`'s SQLite reads and gRPC pushes never contend with the default dispatcher used by hot-path actors.
|
||||
|
||||
### Split-brain resolution
|
||||
|
||||
The keep-oldest strategy is the only strategy `ClusterOptionsValidator` permits for ScadaBridge's two-node clusters. Quorum strategies (`keep-majority`, `static-quorum`) cannot distinguish a crash from a partition with two nodes — both sides would be below quorum and both would shut down. Keep-oldest with `down-if-alone = on` ensures at most one node runs the cluster at any time:
|
||||
|
||||
- On a network partition, the older node stays active; the younger node downs itself.
|
||||
- If the oldest node finds itself alone (no reachable members), it downs itself rather than running in isolation. Without `down-if-alone`, the oldest node could run as a single-node cluster while the younger node forms its own — producing two live clusters with divergent singleton state.
|
||||
|
||||
### Failure detection and failover timeline
|
||||
|
||||
Detection uses two independent Akka heartbeat channels:
|
||||
|
||||
- **Cluster failure detector** (`akka.cluster.failure-detector`): monitors membership, triggers `Unreachable` events that the split-brain resolver acts on.
|
||||
- **Transport failure detector** (`akka.remote.transport-failure-detector`): monitors the underlying TCP transport between nodes; configured separately from `CommunicationOptions.TransportHeartbeatInterval` / `TransportFailureThreshold`.
|
||||
|
||||
With the defaults in `ClusterOptions`, the total failover budget is approximately 25 seconds:
|
||||
|
||||
| Phase | Duration | Source |
|
||||
|-------|----------|--------|
|
||||
| Failure detection (`acceptable-heartbeat-pause`) | 10 s | `ClusterOptions.FailureDetectionThreshold` |
|
||||
| Split-brain stable-after | 15 s | `ClusterOptions.StableAfter` |
|
||||
| Singleton restart | < 1 s | Actor `PreStart` |
|
||||
|
||||
### Graceful shutdown and singleton handover
|
||||
|
||||
When a node is stopped cleanly, `CoordinatedShutdown` runs before the CLR exits (`run-by-clr-shutdown-hook = on`). The cluster-leave phase signals Akka to migrate singletons before the actor system terminates, so handover happens in seconds rather than waiting for the full failure-detection timeout. `SiteCallAuditActor` has an explicit graceful-stop task registered on `PhaseClusterLeave` with a 10-second timeout to drain any in-flight EF Core upsert before handover opens:
|
||||
|
||||
```csharp
|
||||
siteCallAuditShutdown.AddTask(
|
||||
Akka.Actor.CoordinatedShutdown.PhaseClusterLeave,
|
||||
"drain-site-call-audit-singleton",
|
||||
async () =>
|
||||
{
|
||||
try
|
||||
{
|
||||
await siteCallAuditSingletonManager.GracefulStop(TimeSpan.FromSeconds(10));
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogWarning(ex,
|
||||
"SiteCallAudit singleton did not drain within the graceful-stop "
|
||||
+ "timeout; falling through to PoisonPill handover");
|
||||
}
|
||||
return Akka.Done.Instance;
|
||||
});
|
||||
```
|
||||
|
||||
### Cluster roles and singleton scoping
|
||||
|
||||
Each node carries one or more cluster roles set in the HOCON `roles` list. Site nodes carry both a base `"Site"` role and a site-specific role (`"site-{SiteId}"`, e.g. `"site-site-a"`). Singletons on site clusters are scoped to the site-specific role so each site's singleton runs on exactly one node of that site's cluster, not on any other site's nodes. Central singletons use no role scope — all central nodes share the `"Central"` role.
|
||||
|
||||
### Dual-node recovery
|
||||
|
||||
Because both nodes are configured as seed nodes, whichever node starts first after a simultaneous failure forms a new cluster; the second joins when it comes up. No startup ordering dependency exists, and no manual intervention is required. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
|
||||
|
||||
### Cluster singletons hosted
|
||||
|
||||
The Host wires the following singletons. Cluster Infrastructure provides the `ClusterSingletonManager` / `ClusterSingletonProxy` pattern; each singleton's behaviour is documented in the owning component.
|
||||
|
||||
**Central singletons (active central node, no role scope):**
|
||||
|
||||
| Singleton name | Actor class | Owner |
|
||||
|----------------|-------------|-------|
|
||||
| `notification-outbox` | `NotificationOutboxActor` | Notification Outbox (#21) |
|
||||
| `audit-log-ingest` | `AuditLogIngestActor` | Audit Log (#23) |
|
||||
| `site-call-audit` | `SiteCallAuditActor` | Site Call Audit (#22) |
|
||||
|
||||
**Site singletons (active site node, scoped to `"site-{SiteId}"` role):**
|
||||
|
||||
| Singleton name | Actor class | Owner |
|
||||
|----------------|-------------|-------|
|
||||
| `deployment-manager` | `DeploymentManagerActor` | Site Runtime (#3) |
|
||||
| `event-log-handler` | `EventLogHandlerActor` | Site Event Logging (#12) |
|
||||
|
||||
`SiteAuditTelemetryActor` (Audit Log #23) is **not** a singleton — it runs on every site node and reads node-local SQLite. It is created directly with `ActorOf` and bound to the `audit-telemetry-dispatcher`.
|
||||
|
||||
## Usage
|
||||
|
||||
### Registering the configuration contract
|
||||
|
||||
Every host calls `AddClusterInfrastructure` to register `ClusterOptionsValidator`:
|
||||
|
||||
```csharp
|
||||
services.AddClusterInfrastructure();
|
||||
```
|
||||
|
||||
This registers `ClusterOptionsValidator` as an `IValidateOptions<ClusterOptions>` singleton. Because the Host binds `ClusterOptions` with `ValidateOnStart`, a misconfigured `ScadaBridge:Cluster` section (wrong strategy, `MinNrOfMembers != 1`, `DownIfAlone = false`, fewer than two seed nodes) throws an `OptionsValidationException` at startup rather than booting into a broken cluster.
|
||||
|
||||
### Checking active-node status
|
||||
|
||||
Components that must run only on the active node resolve `IActiveNodeGate` (registered by the Host's Central composition root):
|
||||
|
||||
```csharp
|
||||
public bool IsActiveNode
|
||||
{
|
||||
get
|
||||
{
|
||||
var system = _akkaService.ActorSystem;
|
||||
if (system == null) return false;
|
||||
var cluster = Cluster.Get(system);
|
||||
var self = cluster.SelfMember;
|
||||
if (self.Status != MemberStatus.Up) return false;
|
||||
var leader = cluster.State.Leader;
|
||||
return leader != null && leader == self.Address;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This returns `false` while the actor system is warming up — the safe-by-default answer matching the standby case. The Inbound API uses this gate to return HTTP 503 on standby nodes.
|
||||
|
||||
## Configuration
|
||||
|
||||
`ClusterOptions` is bound from `ScadaBridge:Cluster`. `NodeOptions` is bound from `ScadaBridge:Node`.
|
||||
|
||||
### `ScadaBridge:Cluster`
|
||||
|
||||
| Key | Type | Default | Description |
|
||||
|-----|------|---------|-------------|
|
||||
| `SeedNodes` | `List<string>` | (required) | Akka seed-node URIs. Must contain at least 2 entries; both nodes list both themselves and their partner. |
|
||||
| `SplitBrainResolverStrategy` | `string` | `"keep-oldest"` | Must be `"keep-oldest"`. Quorum strategies are rejected by `ClusterOptionsValidator`. |
|
||||
| `StableAfter` | `TimeSpan` | `00:00:15` | Cluster must be stable for this duration before the resolver acts to down unreachable nodes. |
|
||||
| `HeartbeatInterval` | `TimeSpan` | `00:00:02` | Cluster failure-detector heartbeat frequency. Must be less than `FailureDetectionThreshold`. |
|
||||
| `FailureDetectionThreshold` | `TimeSpan` | `00:00:10` | `acceptable-heartbeat-pause` for the cluster failure detector. |
|
||||
| `MinNrOfMembers` | `int` | `1` | Must be `1`. A value of `2` blocks the cluster singleton after failover. |
|
||||
| `DownIfAlone` | `bool` | `true` | Must be `true`. See split-brain resolution above. |
|
||||
|
||||
### `ScadaBridge:Node`
|
||||
|
||||
| Key | Type | Default | Description |
|
||||
|-----|------|---------|-------------|
|
||||
| `Role` | `string` | (required) | `"Central"` or `"Site"`. |
|
||||
| `NodeHostname` | `string` | (required) | Hostname this node advertises to the Akka cluster remoting layer. |
|
||||
| `NodeName` | `string` | `""` | Semantic label stamped on audit rows (`SourceNode`). Conventional values: `node-a`/`node-b` for sites, `central-a`/`central-b` for central. |
|
||||
| `SiteId` | `string?` | (required for Site) | Site identifier; appended to the site-specific cluster role (`site-{SiteId}`). |
|
||||
| `RemotingPort` | `int` | `8081` | Akka.NET TCP remoting port. Default `8081` for central, `8082` for site. |
|
||||
| `GrpcPort` | `int` | `8083` | Kestrel HTTP/2 port for `SiteStreamGrpcServer` (site nodes only). Must differ from `RemotingPort`. |
|
||||
| `MetricsPort` | `int` | `8084` | Kestrel HTTP/1.1 port for the Prometheus `/metrics` scrape endpoint (site nodes only). Must differ from `RemotingPort` and `GrpcPort`. |
|
||||
|
||||
### Representative docker configuration (central node A)
|
||||
|
||||
```json
|
||||
{
|
||||
"ScadaBridge": {
|
||||
"Node": {
|
||||
"Role": "Central",
|
||||
"NodeName": "central-a",
|
||||
"NodeHostname": "scadabridge-central-a",
|
||||
"RemotingPort": 8081
|
||||
},
|
||||
"Cluster": {
|
||||
"SeedNodes": [
|
||||
"akka.tcp://scadabridge@scadabridge-central-a:8081",
|
||||
"akka.tcp://scadabridge@scadabridge-central-b:8081"
|
||||
],
|
||||
"SplitBrainResolverStrategy": "keep-oldest",
|
||||
"StableAfter": "00:00:15",
|
||||
"HeartbeatInterval": "00:00:02",
|
||||
"FailureDetectionThreshold": "00:00:10",
|
||||
"MinNrOfMembers": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`DownIfAlone` is not present in the docker files because its default value of `true` is correct and `ClusterOptionsValidator` rejects `false`.
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- [Host (#15)](./Host.md) — owns the Akka.NET bootstrap. `AkkaHostedService` consumes `ClusterOptions` and `NodeOptions`, assembles the HOCON, starts the `ActorSystem`, creates all role-specific actors, and wires `CoordinatedShutdown`. The `ClusterInfrastructure` project has no compile-time dependency on the Host; the dependency is reversed at runtime.
|
||||
- [Commons (#16)](./Commons.md) — provides `INodeIdentityProvider` (implemented by `NodeIdentityProvider` in the Host), which supplies the `NodeName` label that audit writers stamp on the `SourceNode` column. Also provides `IClusterNodeProvider` (implemented by `AkkaClusterNodeProvider` in the Host), which the Health Monitoring component uses to report per-node up/down status.
|
||||
- [Health Monitoring (#11)](./HealthMonitoring.md) — uses `IClusterNodeProvider` to list cluster members and determine whether the local node is primary; uses `IActiveNodeGate` (central only) to gate active-node-only health paths. The active/standby distinction reported to central health originates here.
|
||||
- [Site Runtime (#3)](./SiteRuntime.md) — the Deployment Manager singleton is the most operationally critical singleton this infrastructure hosts. It re-creates the full Instance Actor hierarchy from local SQLite on failover. Staggered Instance Actor startup after failover is Site Runtime's responsibility; this component provides the singleton placement guarantee.
|
||||
- [Notification Outbox (#21)](./NotificationOutbox.md), [Site Call Audit (#22)](./SiteCallAudit.md), [Audit Log (#23)](./AuditLog.md) — each hosts one or more central singletons wired by `RegisterCentralActors`. Cluster Infrastructure provides the `ClusterSingletonManager`/`ClusterSingletonProxy` boilerplate and the graceful-shutdown hooks; the business logic lives in the owning component.
|
||||
- [Central–Site Communication (#5)](./Communication.md) — `CentralCommunicationActor` and `SiteCommunicationActor` are created and registered with `ClusterClientReceptionist` inside the same `AkkaHostedService` startup, making them addressable by remote `ClusterClient` instances. The transport-level heartbeat (`TransportHeartbeatInterval`, `TransportFailureThreshold`) is configured separately from the cluster failure-detector and comes from `CommunicationOptions`.
|
||||
- [Inbound API (#14)](./InboundApi.md) — resolves `IActiveNodeGate` to return HTTP 503 on standby central nodes. Gate returns `false` until the actor system is `Up` and this node is the cluster leader.
|
||||
- Design spec: [Component-ClusterInfrastructure.md](../requirements/Component-ClusterInfrastructure.md).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Node fails to join cluster on startup
|
||||
|
||||
`ClusterOptionsValidator` rejects fewer than two seed nodes, a non-`keep-oldest` strategy, `MinNrOfMembers != 1`, or `DownIfAlone = false` at startup with an `OptionsValidationException`. Check that both seed-node URIs reference the Akka remoting port, not the gRPC port (8083) or metrics port (8084) — `StartupValidator` explicitly rejects seed entries whose port matches `GrpcPort`.
|
||||
|
||||
### Singleton not starting after failover
|
||||
|
||||
If the surviving node is `Up` but singletons do not start, `MinNrOfMembers` is the first thing to check. A value of `2` keeps the surviving node waiting for a second member indefinitely. The validator enforces `1`, but a manually patched `appsettings.json` that bypasses the validator could produce this.
|
||||
|
||||
### Two live clusters (split-brain)
|
||||
|
||||
If `DownIfAlone = false` were accepted (the validator rejects it), the oldest node could run alone while the younger forms its own cluster, producing two live clusters with divergent singleton state and dual MS SQL writers on central. `ClusterOptionsValidator` makes this configuration impossible to boot.
|
||||
|
||||
### Graceful shutdown takes longer than expected
|
||||
|
||||
If a clean node stop takes up to 25 seconds instead of seconds, `CoordinatedShutdown` may not be running — check that `run-by-clr-shutdown-hook = on` is present in the assembled HOCON (it is emitted unconditionally by `BuildHocon`) and that the Windows Service stop signal reaches the process rather than being killed. A `SIGKILL` / `TerminateProcess` bypasses `CoordinatedShutdown` entirely; the surviving node then has to wait the full failure-detection window.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Cluster Infrastructure design specification](../requirements/Component-ClusterInfrastructure.md)
|
||||
- [Host](./Host.md)
|
||||
- [Site Runtime](./SiteRuntime.md)
|
||||
- [Health Monitoring](./HealthMonitoring.md)
|
||||
- [Central–Site Communication](./Communication.md)
|
||||
- [Notification Outbox](./NotificationOutbox.md)
|
||||
- [Site Call Audit](./SiteCallAudit.md)
|
||||
- [Audit Log](./AuditLog.md)
|
||||
- [Commons](./Commons.md)
|
||||
@@ -0,0 +1,320 @@
|
||||
# Commons
|
||||
|
||||
Commons is the foundational shared library that all other ScadaBridge components depend on — it defines the POCO entity classes, repository interfaces, service interfaces, message contracts, shared enums, and utility types that the system builds on top of.
|
||||
|
||||
## Overview
|
||||
|
||||
Commons (#16) is not a runtime component. It has no actors, no hosted services, and no DI registrations of its own. Its single role is to hold the shared type vocabulary — entity shapes, interface contracts, and message definitions — so that every component agrees on the same types without depending on each other.
|
||||
|
||||
The project enforces minimal dependencies by design: only core .NET SDK. It must not reference Akka.NET, ASP.NET Core, Entity Framework Core, or any persistence or framework library, because it is referenced by all other projects and a framework dependency here becomes a transitive constraint on everything.
|
||||
|
||||
Source lives in `src/ZB.MOM.WW.ScadaBridge.Commons/`, organized into four top-level namespaces: `Types/`, `Interfaces/`, `Entities/`, and `Messages/`.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Persistence-ignorant entity classes
|
||||
|
||||
All configuration database entity classes live in `Entities/` as plain C# classes with no EF attributes, no EF base classes, and no persistence-framework annotations. Navigation properties (for example `Template.Attributes`) are plain `ICollection<T>` — EF Fluent API configuration is the Configuration Database component's job, not Commons'. The entities may include constructors that enforce required fields:
|
||||
|
||||
```csharp
|
||||
// Entities/Templates/Template.cs
|
||||
public class Template
|
||||
{
|
||||
public int Id { get; set; }
|
||||
public string Name { get; set; }
|
||||
public string? Description { get; set; }
|
||||
public int? ParentTemplateId { get; set; }
|
||||
public int? FolderId { get; set; }
|
||||
public ICollection<TemplateAttribute> Attributes { get; set; } = new List<TemplateAttribute>();
|
||||
public ICollection<TemplateAlarm> Alarms { get; set; } = new List<TemplateAlarm>();
|
||||
public ICollection<TemplateScript> Scripts { get; set; } = new List<TemplateScript>();
|
||||
public ICollection<TemplateComposition> Compositions { get; set; } = new List<TemplateComposition>();
|
||||
public ICollection<TemplateNativeAlarmSource> NativeAlarmSources { get; set; } = new List<TemplateNativeAlarmSource>();
|
||||
public bool IsDerived { get; set; }
|
||||
public int? OwnerCompositionId { get; set; }
|
||||
|
||||
public Template(string name)
|
||||
{
|
||||
Name = name ?? throw new ArgumentNullException(nameof(name));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Repository interfaces
|
||||
|
||||
Commons defines one repository interface per consuming component. Implementations live entirely in the Configuration Database component. Each interface accepts and returns the POCO entity classes from Commons, and every interface includes `SaveChangesAsync()` to support the unit-of-work pattern without requiring a dependency on EF Core.
|
||||
|
||||
### Message contracts and additive-only evolution
|
||||
|
||||
Messages in `Messages/` are `record` types or immutable classes. Because sites and central may temporarily run different software versions, the rule is additive-only: new fields may be added with defaults; existing fields must not be removed or have their types changed. Contracts that cross the site→central gRPC boundary — `CachedCallTelemetry`, `AuditTelemetryEnvelope`, `NotificationSubmit`, and the pull reconciliation messages — are the most version-sensitive and have this rule explicitly called out in their XML docs.
|
||||
|
||||
### Pure-helper carve-out
|
||||
|
||||
Commons may contain stateless, side-effect-free helper types that transform or validate the data types it already defines. Anything that would require I/O, shared mutable state across calls beyond a self-contained instance, or knowledge of another component is excluded. Current examples: `Result<T>`, `ScriptParameters`, `ValueFormatter`, `DynamicJsonElement`, `StaleTagMonitor`, `OpcUaEndpointConfigSerializer`, and `OpcUaEndpointConfigValidator`.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Namespace and folder structure
|
||||
|
||||
```
|
||||
ZB.MOM.WW.ScadaBridge.Commons/
|
||||
├── Types/ # Enums/, Alarms/, Audit/, DataConnections/,
|
||||
│ # Flattening/, InboundApi/, Notifications/,
|
||||
│ # Transport/, Scripts/ + top-level utility types
|
||||
├── Interfaces/ # Protocol/, Repositories/, Services/, Transport/,
|
||||
│ # Security/
|
||||
├── Entities/ # Templates/, Instances/, Sites/, ExternalSystems/,
|
||||
│ # Notifications/, InboundApi/, Security/,
|
||||
│ # Deployment/, Scripts/, Audit/
|
||||
├── Messages/ # Deployment/, Lifecycle/, Health/, Communication/,
|
||||
│ # Streaming/, DebugView/, ScriptExecution/,
|
||||
│ # Artifacts/, DataConnection/, Instance/,
|
||||
│ # Integration/, Notification/, InboundApi/,
|
||||
│ # RemoteQuery/, Audit/, Management/
|
||||
├── Observability/ # ScadaBridgeTelemetry (meter + instrument definitions)
|
||||
├── Serialization/ # OpcUaEndpointConfigSerializer, MxGatewayEndpointConfigSerializer
|
||||
└── Validators/ # OpcUaEndpointConfigValidator, MxGatewayEndpointConfigValidator
|
||||
```
|
||||
|
||||
Namespaces mirror folders: `ZB.MOM.WW.ScadaBridge.Commons.Entities.Templates`, `ZB.MOM.WW.ScadaBridge.Commons.Interfaces.Repositories`, and so on.
|
||||
|
||||
### Entity classes by domain area
|
||||
|
||||
| Folder | Classes |
|
||||
|---|---|
|
||||
| `Entities/Templates/` | `Template`, `TemplateAttribute`, `TemplateAlarm`, `TemplateNativeAlarmSource`, `TemplateScript`, `TemplateComposition`, `TemplateFolder` |
|
||||
| `Entities/Instances/` | `Instance`, `InstanceAttributeOverride`, `InstanceConnectionBinding`, `InstanceAlarmOverride`, `InstanceNativeAlarmSourceOverride`, `Area` |
|
||||
| `Entities/Sites/` | `Site`, `DataConnection` |
|
||||
| `Entities/ExternalSystems/` | `ExternalSystemDefinition`, `ExternalSystemMethod`, `DatabaseConnectionDefinition` |
|
||||
| `Entities/Notifications/` | `NotificationList`, `NotificationRecipient`, `SmtpConfiguration`, `Notification` |
|
||||
| `Entities/InboundApi/` | `ApiKey`, `ApiMethod` |
|
||||
| `Entities/Security/` | `LdapGroupMapping`, `SiteScopeRule` |
|
||||
| `Entities/Deployment/` | `DeploymentRecord`, `SystemArtifactDeploymentRecord`, `DeployedConfigSnapshot` |
|
||||
| `Entities/Scripts/` | `SharedScript` |
|
||||
| `Entities/Audit/` | `AuditLogEntry` (config-change audit), `SiteCall` (SiteCalls operational mirror) |
|
||||
|
||||
The `Instance` entity illustrates the typical POCO shape — required fields enforced by a constructor, navigation collections as plain `ICollection<T>`, and no persistence annotations:
|
||||
|
||||
```csharp
|
||||
// Entities/Instances/Instance.cs
|
||||
public class Instance
|
||||
{
|
||||
public int Id { get; set; }
|
||||
public int TemplateId { get; set; }
|
||||
public int SiteId { get; set; }
|
||||
public int? AreaId { get; set; }
|
||||
public string UniqueName { get; set; }
|
||||
public InstanceState State { get; set; }
|
||||
public ICollection<InstanceAttributeOverride> AttributeOverrides { get; set; } = new List<InstanceAttributeOverride>();
|
||||
public ICollection<InstanceAlarmOverride> AlarmOverrides { get; set; } = new List<InstanceAlarmOverride>();
|
||||
public ICollection<InstanceConnectionBinding> ConnectionBindings { get; set; } = new List<InstanceConnectionBinding>();
|
||||
public ICollection<InstanceNativeAlarmSourceOverride> NativeAlarmSourceOverrides { get; set; } = new List<InstanceNativeAlarmSourceOverride>();
|
||||
|
||||
public Instance(string uniqueName)
|
||||
{
|
||||
UniqueName = uniqueName ?? throw new ArgumentNullException(nameof(uniqueName));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Repository interfaces by consuming component
|
||||
|
||||
| Interface | Consuming component | Scope |
|
||||
|---|---|---|
|
||||
| `ITemplateEngineRepository` | Template Engine | Templates, attributes, alarms, native alarm sources, scripts, compositions, folders, instances, overrides, connection bindings, areas, shared scripts |
|
||||
| `IDeploymentManagerRepository` | Deployment Manager | Deployment records, snapshots, system artifact deployments |
|
||||
| `ISecurityRepository` | Security & Auth | LDAP group mappings, site scope rules |
|
||||
| `IInboundApiRepository` | Inbound API | API keys, API method definitions |
|
||||
| `IExternalSystemRepository` | External System Gateway | External system definitions, methods, database connection definitions |
|
||||
| `INotificationRepository` | Notification Service | Notification lists, recipients, SMTP configuration |
|
||||
| `INotificationOutboxRepository` | Notification Outbox | `Notifications` table: ingest, due-row polling, status transitions, KPI queries, bulk purge |
|
||||
| `ISiteCallAuditRepository` | Site Call Audit | `SiteCalls` table: ingest, upsert-on-newer-status, KPI queries, bulk purge |
|
||||
| `IAuditLogRepository` | Audit Log | `AuditLog` table: idempotent ingest, keyset-paged query, partition switch-out, KPI snapshots, execution tree walk |
|
||||
| `ISiteRepository` | Central UI, Site Runtime | Sites, data connections, site assignments |
|
||||
| `ICentralUiRepository` | Central UI | Read-spanning queries for display |
|
||||
|
||||
`IAuditLogRepository` enforces the append-only contract at the API level — it exposes no Update and no single-row Delete. Bulk purge is `SwitchOutPartitionAsync` only. Ingest is idempotent on `EventId`:
|
||||
|
||||
```csharp
|
||||
// Interfaces/Repositories/IAuditLogRepository.cs
|
||||
public interface IAuditLogRepository
|
||||
{
|
||||
Task InsertIfNotExistsAsync(AuditEvent evt, CancellationToken ct = default);
|
||||
|
||||
Task<IReadOnlyList<AuditEvent>> QueryAsync(
|
||||
AuditLogQueryFilter filter,
|
||||
AuditLogPaging paging,
|
||||
CancellationToken ct = default);
|
||||
|
||||
Task<long> SwitchOutPartitionAsync(DateTime monthBoundary, CancellationToken ct = default);
|
||||
|
||||
Task<IReadOnlyList<DateTime>> GetPartitionBoundariesOlderThanAsync(
|
||||
DateTime threshold,
|
||||
CancellationToken ct = default);
|
||||
|
||||
Task<AuditLogKpiSnapshot> GetKpiSnapshotAsync(
|
||||
TimeSpan window,
|
||||
DateTime? nowUtc = null,
|
||||
CancellationToken ct = default);
|
||||
|
||||
Task<IReadOnlyList<ExecutionTreeNode>> GetExecutionTreeAsync(
|
||||
Guid executionId,
|
||||
CancellationToken ct = default);
|
||||
|
||||
Task<IReadOnlyList<string>> GetDistinctSourceNodesAsync(CancellationToken ct = default);
|
||||
}
|
||||
```
|
||||
|
||||
### Cross-cutting service interfaces
|
||||
|
||||
`Interfaces/Services/` holds service interfaces for cross-cutting concerns that multiple components consume but do not implement in Commons itself.
|
||||
|
||||
| Interface | Purpose | Implemented by |
|
||||
|---|---|---|
|
||||
| `IAuditService` | Configuration-change audit log entry (`LogAsync`). Central components call this through the UoW. | Configuration Database |
|
||||
| `IAuditWriter` | Site hot-path audit write (`WriteAsync`). Best-effort; must never throw back at the caller. | Audit Log |
|
||||
| `ICentralAuditWriter` | Central direct-write for central-originated audit rows; insert-if-not-exists on `EventId`. | Audit Log |
|
||||
| `ISiteAuditQueue` | Hands off site audit rows to the gRPC telemetry forwarder. | Audit Log |
|
||||
| `ICachedCallLifecycleObserver` / `ICachedCallTelemetryForwarder` | Bridge between the S&F Engine's lifecycle transitions and the `CachedCallTelemetry` packet. | Audit Log |
|
||||
| `INodeIdentityProvider` | Resolves the current node's `SourceNode` label (`node-a`, `central-b`, etc.). | Host |
|
||||
| `IOperationTrackingStore` | Site-local SQLite tracking status store for `Tracking.Status(id)`. | Site Runtime |
|
||||
| `IPartitionMaintenance` | Central partition-switch / retention purge hook. | Audit Log |
|
||||
| `IDatabaseGateway` | Script-facing ADO.NET database access via named connections. | External System Gateway |
|
||||
| `IExternalSystemClient` | Script-facing `ExternalSystem.Call()` / `CachedCall()` invocation. | External System Gateway |
|
||||
| `IInstanceLocator` | Resolves instance unique name to site identifier for `Route.To()`. | Management Service |
|
||||
| `INotificationDeliveryService` | Script-facing `Notify.Send()` routing with S&F fallback. | Notification Service |
|
||||
|
||||
Transport bundle interfaces (`IBundleExporter`, `IBundleImporter`, `IBundleSessionStore`, `IAuditCorrelationContext`) live in `Interfaces/Transport/` and are defined in Commons so the Configuration Database and Central UI can depend on the abstraction without taking a Transport component dependency.
|
||||
|
||||
### Key shared types
|
||||
|
||||
**`Result<T>`** is the system-wide discriminated result type. A failed result always carries a non-blank error message; callers pattern-match via `Match`:
|
||||
|
||||
```csharp
|
||||
// Types/Result.cs
|
||||
public sealed class Result<T>
|
||||
{
|
||||
public bool IsSuccess { get; }
|
||||
public bool IsFailure => !IsSuccess;
|
||||
public T Value => IsSuccess ? _value! : throw new InvalidOperationException("...");
|
||||
public string Error => IsFailure ? _error! : throw new InvalidOperationException("...");
|
||||
|
||||
public static Result<T> Success(T value) => new(value);
|
||||
public static Result<T> Failure(string error) => new(error);
|
||||
|
||||
public TResult Match<TResult>(Func<T, TResult> onSuccess, Func<string, TResult> onFailure) =>
|
||||
IsSuccess ? onSuccess(_value!) : onFailure(_error!);
|
||||
}
|
||||
```
|
||||
|
||||
**`TrackedOperationId`** is the strongly-typed GUID that identifies a cached outbound operation end-to-end — it is the idempotency key on every `AuditLog` row for that lifecycle and the primary key on the central `SiteCalls` row:
|
||||
|
||||
```csharp
|
||||
// Types/TrackedOperationId.cs
|
||||
public readonly record struct TrackedOperationId(Guid Value)
|
||||
{
|
||||
public static TrackedOperationId New() => new(Guid.NewGuid());
|
||||
public static TrackedOperationId Parse(string s) => new(Guid.Parse(s));
|
||||
public static bool TryParse(string? s, out TrackedOperationId result) { ... }
|
||||
public override string ToString() => Value.ToString("D");
|
||||
}
|
||||
```
|
||||
|
||||
**`AlarmConditionState`** is the unified, read-only alarm condition model shared by computed and native alarms. Computed alarms populate it from `State` + `Priority`; native alarms mirror it from the OPC UA or MxAccess source:
|
||||
|
||||
```csharp
|
||||
// Types/Alarms/AlarmConditionState.cs
|
||||
public record AlarmConditionState(
|
||||
bool Active,
|
||||
bool Acknowledged,
|
||||
bool? Confirmed, // null when the condition is not confirmable
|
||||
AlarmShelveState Shelve,
|
||||
bool Suppressed,
|
||||
int Severity); // 0–1000 unified scale
|
||||
```
|
||||
|
||||
**`ScadaBridgeAuditEventFactory`** is the single construction point for a canonical `AuditEvent`. Every audit emit site builds its row through `Create` so the domain-vocabulary-to-canonical-field mapping (`Channel`/`Kind`/`Status` → `Action`/`Category`/`Outcome`; all other ScadaBridge domain fields → `DetailsJson`) is applied identically everywhere with no per-site drift.
|
||||
|
||||
### Protocol abstraction
|
||||
|
||||
`Interfaces/Protocol/` defines the Data Connection Layer's protocol-neutral interfaces.
|
||||
|
||||
`IDataConnection` is the base interface for reading, writing, and subscribing to device data regardless of protocol. `IBrowsableDataConnection` is an optional capability interface for address-space browsing. `IAlarmSubscribableConnection` is an optional capability interface for connections that can mirror native alarms — implementations expose `SubscribeAlarmsAsync` and `UnsubscribeAlarmsAsync`, delivering transitions as protocol-neutral `NativeAlarmTransition` records via `AlarmTransitionCallback`. The `DataConnectionActor` consumes these via capability checks (runtime `is` cast), keeping protocol knowledge out of the core actor logic.
|
||||
|
||||
### Message contracts
|
||||
|
||||
`Messages/` organizes contracts by concern rather than by sender/receiver:
|
||||
|
||||
| Folder | Key types |
|
||||
|---|---|
|
||||
| `Deployment/` | `DeployInstanceCommand`, `DeploymentStatusResponse`, `FlattenedConfigurationSnapshot` |
|
||||
| `Lifecycle/` | `DisableInstanceCommand`, `EnableInstanceCommand`, `DeleteInstanceCommand`, `InstanceLifecycleResponse` |
|
||||
| `Health/` | `SiteHealthReport`, `HeartbeatMessage`, `NodeStatus`, `TagQualityCounts` |
|
||||
| `Streaming/` | `AttributeValueChanged`, `AlarmStateChanged` (additively enriched for both computed and native alarms) |
|
||||
| `Integration/` | `CachedCallTelemetry`, `AuditTelemetryEnvelope`, `PullAuditEventsRequest/Response` |
|
||||
| `Notification/` | `NotificationSubmit`, `NotificationSubmitAck`, `NotificationStatusQuery/Response` |
|
||||
| `Audit/` | `IngestAuditEventsCommand/Reply`, `IngestCachedTelemetryCommand/Reply`, `UpsertSiteCallCommand/Reply` |
|
||||
| `RemoteQuery/` | Event log queries, parked-message queries, `ParkedOperationRelayMessages` |
|
||||
| `Management/` | All HTTP Management API commands per domain area, `ManagementEnvelope`, `TransportCommands` |
|
||||
|
||||
`CachedCallTelemetry` carries one combined packet per lifecycle event so central can write the `AuditLog` row and the `SiteCalls` upsert in a single MS SQL transaction:
|
||||
|
||||
```csharp
|
||||
// Messages/Integration/CachedCallTelemetry.cs
|
||||
public sealed record CachedCallTelemetry(
|
||||
AuditEvent Audit,
|
||||
SiteCallOperational Operational);
|
||||
```
|
||||
|
||||
`AlarmStateChanged` demonstrates the additive-only evolution rule in practice — the original positional constructor still compiles; native alarm fields are `init` properties with safe defaults, so existing computed-alarm emitters need no change:
|
||||
|
||||
```csharp
|
||||
// Messages/Streaming/AlarmStateChanged.cs
|
||||
public record AlarmStateChanged(
|
||||
string InstanceUniqueName,
|
||||
string AlarmName,
|
||||
AlarmState State,
|
||||
int Priority,
|
||||
DateTimeOffset Timestamp) : ISiteStreamEvent
|
||||
{
|
||||
public AlarmLevel Level { get; init; } = AlarmLevel.None;
|
||||
public AlarmKind Kind { get; init; } = AlarmKind.Computed;
|
||||
public AlarmConditionState Condition { get; init; } = ...; // defaults to computed projection
|
||||
public string SourceReference { get; init; } = string.Empty;
|
||||
// ... additional native-alarm fields with empty defaults
|
||||
}
|
||||
```
|
||||
|
||||
### Observability
|
||||
|
||||
`Observability/ScadaBridgeTelemetry` defines the singleton `Meter` named `ZB.MOM.WW.ScadaBridge` and the application-wide instrument definitions. Components call the static emit helpers (`RecordDeploymentApplied`, `SiteConnectionOpened`, etc.) rather than creating their own meters. Instruments are no-ops until an OTel listener attaches, so uninstrumented hosts pay no overhead.
|
||||
|
||||
## Usage
|
||||
|
||||
Commons is consumed through direct project references — all other components in the solution reference it. There is nothing to register or configure; the types are available as soon as the project reference is in place.
|
||||
|
||||
When adding a new entity class: add the POCO to the appropriate `Entities/<DomainArea>/` subfolder with no EF attributes, then add the corresponding repository method signature to the relevant interface in `Interfaces/Repositories/`. The Configuration Database component owns the EF mapping and the implementation.
|
||||
|
||||
When adding a new message contract: add an immutable `record` to the appropriate `Messages/<Concern>/` subfolder. If the message will cross the site→central version-skew boundary, apply the additive-only rule immediately — use `init` properties with defaults for any fields beyond the initial set so older receivers can safely ignore unknown fields.
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- **No runtime dependencies** — Commons references only the core .NET SDK. It does not reference Akka.NET, ASP.NET Core, Entity Framework Core, or any third-party library.
|
||||
- [Configuration Database (#17)](./ConfigurationDatabase.md) — implements every repository interface defined here (`ITemplateEngineRepository`, `IAuditLogRepository`, etc.) via EF Core Fluent API; maps the POCO entity classes to the underlying MS SQL schema.
|
||||
- **All other components** — reference Commons for shared types, entity classes, interface contracts, and message definitions. The dependency graph is strictly one-way: Commons knows nothing about its consumers.
|
||||
- [Audit Log (#23)](./AuditLog.md) — implements `IAuditWriter`, `ICentralAuditWriter`, `ISiteAuditQueue`, `ICachedCallLifecycleObserver`, and `ICachedCallTelemetryForwarder`; consumes `ScadaBridgeAuditEventFactory`, `AuditDetailsCodec`, `AuditRowProjection`, and the audit message contracts defined here.
|
||||
- [Site Call Audit (#22)](./SiteCallAudit.md) — consumes `ISiteCallAuditRepository` and the `CachedCallTelemetry` / `UpsertSiteCallCommand` message types.
|
||||
- [Notification Outbox (#21)](./NotificationOutbox.md) — consumes `INotificationOutboxRepository` and the `NotificationSubmit` / `NotificationSubmitAck` contracts.
|
||||
- [Transport (#24)](./Transport.md) — its interfaces (`IBundleExporter`, `IBundleImporter`, `IBundleSessionStore`, `IAuditCorrelationContext`) and value objects (`BundleManifest`, `ImportPreview`, etc.) are defined in Commons so Configuration Database and Central UI can depend on the abstraction without a Transport project reference.
|
||||
- Design spec: [Component-Commons.md](../requirements/Component-Commons.md).
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Commons design specification](../requirements/Component-Commons.md)
|
||||
- [Configuration Database](./ConfigurationDatabase.md)
|
||||
- [Audit Log](./AuditLog.md)
|
||||
- [Site Call Audit](./SiteCallAudit.md)
|
||||
- [Notification Outbox](./NotificationOutbox.md)
|
||||
- [Transport](./Transport.md)
|
||||
- [Data Connection Layer](./DataConnectionLayer.md)
|
||||
- [Health Monitoring](./HealthMonitoring.md)
|
||||
@@ -0,0 +1,277 @@
|
||||
# Central–Site Communication
|
||||
|
||||
The Central–Site Communication component is the transport layer that connects the central cluster to every site cluster. It provides two independent transports — Akka.NET `ClusterClient` for command/control and gRPC server-streaming for real-time data — wired together through a pair of actors that each cluster registers with the `ClusterClientReceptionist`.
|
||||
|
||||
## Overview
|
||||
|
||||
Communication (#5) runs on every node in every cluster. The component code lives in `src/ZB.MOM.WW.ScadaBridge.Communication/`, organised as follows:
|
||||
|
||||
- `Actors/` — `CentralCommunicationActor`, `SiteCommunicationActor`, `DebugStreamBridgeActor`, `StreamRelayActor`.
|
||||
- `Grpc/` — `SiteStreamGrpcServer`, `SiteStreamGrpcClient`, `SiteStreamGrpcClientFactory`, `ISiteStreamSubscriber`, and the proto DTO mappers.
|
||||
- `Protos/` — `sitestream.proto` (proto source; generated C# is vendored in `SiteStreamGrpc/`).
|
||||
- `CommunicationService.cs` — typed Ask-pattern façade used by callers on the central side.
|
||||
- `DebugStreamService.cs` — session manager for debug stream bridge actors.
|
||||
- `CommunicationOptions.cs` — configuration options class.
|
||||
- `ServiceCollectionExtensions.cs` — DI registration (`AddCommunication`).
|
||||
|
||||
DI registration is called from the Host composition root via `AddCommunication`. The actors themselves are created inside `AkkaHostedService.RegisterCentralActors` / `RegisterSiteActors` because they must be created within the actor system, not by the DI container.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Two transports, two concerns
|
||||
|
||||
| Transport | Direction | Purpose |
|
||||
|-----------|-----------|---------|
|
||||
| Akka.NET `ClusterClient` | bidirectional (command/control) | Deployments, lifecycle, subscribe/unsubscribe handshake, snapshots, heartbeats, health reports, telemetry, notifications |
|
||||
| gRPC server-streaming (`SiteStreamService`) | site → central | Real-time attribute value and alarm state changes |
|
||||
|
||||
The transports are independent. A gRPC stream interruption does not affect in-flight `ClusterClient` commands, and vice versa.
|
||||
|
||||
### Hub-and-spoke topology
|
||||
|
||||
Sites do not communicate with each other. All inter-cluster traffic flows through central. Central maintains one `ClusterClient` per site; each site maintains a single `ClusterClient` pointed at both central nodes.
|
||||
|
||||
### `SiteEnvelope` routing
|
||||
|
||||
Central-side callers wrap outbound messages in a `SiteEnvelope(SiteId, Message)`. `CentralCommunicationActor` resolves the site's `ClusterClient` by `SiteId` and forwards the inner message to `/user/site-communication` on the site:
|
||||
|
||||
```csharp
|
||||
// CommunicationService.cs — deployment pattern
|
||||
public async Task<DeploymentStatusResponse> DeployInstanceAsync(
|
||||
string siteId, DeployInstanceCommand command, CancellationToken cancellationToken = default)
|
||||
{
|
||||
var envelope = new SiteEnvelope(siteId, command);
|
||||
return await GetActor().Ask<DeploymentStatusResponse>(
|
||||
envelope, _options.DeploymentTimeout, cancellationToken);
|
||||
}
|
||||
```
|
||||
|
||||
`CentralCommunicationActor.HandleSiteEnvelope` extracts the inner message and routes it via the cached `ClusterClient`:
|
||||
|
||||
```csharp
|
||||
private void HandleSiteEnvelope(SiteEnvelope envelope)
|
||||
{
|
||||
if (!_siteClients.TryGetValue(envelope.SiteId, out var entry))
|
||||
{
|
||||
_log.Warning("No ClusterClient for site {0}, cannot route message {1}",
|
||||
envelope.SiteId, envelope.Message.GetType().Name);
|
||||
return; // caller's Ask times out — no central buffering
|
||||
}
|
||||
|
||||
entry.Client.Tell(
|
||||
new ClusterClient.Send("/user/site-communication", envelope.Message),
|
||||
Sender);
|
||||
}
|
||||
```
|
||||
|
||||
### No central buffering
|
||||
|
||||
If a site is unreachable when a command arrives, the caller's Ask times out. Central never queues command/control messages on behalf of a site. This is deliberate: it keeps the central coordinator stateless with respect to site availability and pushes retry responsibility to the operator or to the Store-and-Forward Engine for messages that tolerate it.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Central-side: `CentralCommunicationActor`
|
||||
|
||||
`CentralCommunicationActor` is a `ReceiveActor` created at `/user/central-communication` and registered with `ClusterClientReceptionist` so the site's `ClusterClient` can locate it. It owns:
|
||||
|
||||
- A `Dictionary<string, (IActorRef Client, ImmutableHashSet<string> ContactAddresses)>` keyed by site identifier — one `ClusterClient` per site.
|
||||
- A `RefreshSiteAddresses` periodic timer (60-second cadence, starting immediately). Each tick fires `LoadSiteAddressesFromDb`, which reads every `Site` row from the database, parses `NodeAAddress` and `NodeBAddress` into Akka receptionist paths (`{addr}/system/receptionist`), and pipes a `SiteAddressCacheLoaded` message back to Self. `HandleSiteAddressCacheLoaded` creates, updates, or stops `ClusterClient` actors based on the diff.
|
||||
- Proxy references to `NotificationOutboxActor` and `AuditLogIngestActor` cluster singletons, injected post-construction via `RegisterNotificationOutbox` / `RegisterAuditIngest` messages from the Host. Messages that arrive before the proxy is registered are answered with a non-accepted ack (notifications) or an empty reply (audit), so the site retries without data loss.
|
||||
- Fanout of `SiteHealthReport` to the peer central node via `DistributedPubSub`, keyed on the `site-health-replica` topic, so both central nodes' aggregators stay in sync regardless of which central node the site's `ClusterClient` load-balanced the report to.
|
||||
|
||||
`ISiteClientFactory` / `DefaultSiteClientFactory` abstract `ClusterClient` construction for testability.
|
||||
|
||||
### Site-side: `SiteCommunicationActor`
|
||||
|
||||
`SiteCommunicationActor` is a `ReceiveActor` created at `/user/site-communication` and registered with `ClusterClientReceptionist`. It owns:
|
||||
|
||||
- An `IActorRef? _centralClient` — the site's outbound `ClusterClient` to central. Injected post-construction via `RegisterCentralClient`.
|
||||
- A `Timers`-based heartbeat (default 5-second interval, first tick after 1 second). Each tick sends a `HeartbeatMessage` with `IsActive` stamped from the Akka `Cluster` leader check — the node is active when its `MemberStatus` is `Up` and it holds cluster leadership.
|
||||
- Dispatch to local handlers for every inbound command pattern. Handlers for event-log, parked-message, integration, and artifact patterns are registered post-construction via `RegisterLocalHandler`; unregistered patterns receive an inline error reply so the central Ask does not stall.
|
||||
|
||||
Site-to-central messages (health reports, audit batches, notification submissions) are sent via:
|
||||
|
||||
```csharp
|
||||
_centralClient.Tell(
|
||||
new ClusterClient.Send("/user/central-communication", msg), Sender);
|
||||
```
|
||||
|
||||
The original `Sender` is forwarded as the `ClusterClient.Send` sender so any reply from central routes straight back to the waiting Ask on the site, not through `SiteCommunicationActor`.
|
||||
|
||||
### Address loading and the 60-second refresh
|
||||
|
||||
`CentralCommunicationActor` calls `ISiteRepository.GetAllSitesAsync` inside a background `Task.Run` (to avoid blocking the actor thread on a database round-trip) and pipes the result as `SiteAddressCacheLoaded`. The actor-lifecycle `CancellationTokenSource` is threaded into the repository call so a slow MS SQL query is cancelled when the actor stops.
|
||||
|
||||
A malformed address for one site does not abort the refresh loop — the actor catches the parse failure, logs a warning, skips that site, and processes the rest. The refresh also runs immediately on startup (`TimeSpan.Zero` initial delay) so the cache is populated before the first command arrives.
|
||||
|
||||
`CommunicationService.RefreshSiteAddresses()` triggers an on-demand refresh when a site record is added, edited, or deleted from the Central UI or CLI.
|
||||
|
||||
### gRPC real-time data transport
|
||||
|
||||
Real-time attribute value and alarm state changes are delivered over `SiteStreamService`, a gRPC server-streaming service defined in `sitestream.proto`.
|
||||
|
||||
**Site-side** — `SiteStreamGrpcServer` (Kestrel HTTP/2, port 8083):
|
||||
|
||||
- Implements `SiteStreamService.SiteStreamServiceBase`.
|
||||
- For each `SubscribeInstance` call, creates a `StreamRelayActor` (named `stream-relay-{correlationId}-{seq}`) and subscribes it to `ISiteStreamSubscriber` (implemented by `SiteStreamManager` in the Site Runtime project — `SiteStreamGrpcServer` holds only the interface so it does not reference `SiteRuntime` directly).
|
||||
- Bridges events via a `BoundedChannel<SiteStreamEvent>` (capacity 1000, `DropOldest`) from the actor thread to the async gRPC write loop.
|
||||
- Enforces a `GrpcMaxConcurrentStreams` limit (default 100) and a `GrpcMaxStreamLifetime` session timeout (default 4 hours) to evict zombie streams.
|
||||
- Validates `correlation_id` against `ActorPath.IsValidPathElement` before use in an actor name, rejecting invalid values with `StatusCode.InvalidArgument`.
|
||||
- During `CoordinatedShutdown`, `CancelAllStreams()` flips `_shuttingDown`, refuses new subscriptions with `StatusCode.Unavailable`, and cancels all active `CancellationTokenSource`s.
|
||||
|
||||
`StreamRelayActor` is a lightweight `ReceiveActor` that converts `AttributeValueChanged` and `AlarmStateChanged` domain events to proto `SiteStreamEvent` messages and writes them to the channel writer:
|
||||
|
||||
```csharp
|
||||
// StreamRelayActor.cs
|
||||
private void HandleAttributeValueChanged(AttributeValueChanged msg)
|
||||
{
|
||||
var protoEvent = new SiteStreamEvent
|
||||
{
|
||||
CorrelationId = _correlationId,
|
||||
AttributeChanged = new AttributeValueUpdate
|
||||
{
|
||||
InstanceUniqueName = msg.InstanceUniqueName,
|
||||
AttributePath = msg.AttributePath,
|
||||
AttributeName = msg.AttributeName,
|
||||
Value = ValueFormatter.FormatDisplayValue(msg.Value),
|
||||
Quality = MapQuality(msg.Quality),
|
||||
Timestamp = Timestamp.FromDateTimeOffset(msg.Timestamp)
|
||||
}
|
||||
};
|
||||
WriteToChannel(protoEvent);
|
||||
}
|
||||
```
|
||||
|
||||
**Central-side** — `SiteStreamGrpcClient` / `SiteStreamGrpcClientFactory`:
|
||||
|
||||
- `SiteStreamGrpcClientFactory` (singleton) caches one `SiteStreamGrpcClient` per site identifier. On `GetOrCreate`, it compares the cached client's `Endpoint` to the requested endpoint and atomically replaces a stale client (different endpoint — NodeA→NodeB failover flip, or an edited address) with a fresh one.
|
||||
- `SiteStreamGrpcClient` opens a `GrpcChannel` with HTTP/2 keepalive (`KeepAlivePingDelay` default 15 s, `KeepAlivePingTimeout` default 10 s, `KeepAlivePingPolicy.Always`). It calls `SubscribeInstance` and reads the response stream in a background `Task.Run`, invoking `onEvent` for each received event and `onError` on any non-cancellation exception.
|
||||
|
||||
### Debug stream session lifecycle
|
||||
|
||||
`DebugStreamService` manages one `DebugStreamBridgeActor` per active debug session. On `StartStreamAsync`, it resolves the instance's site and gRPC addresses, creates the bridge actor, and holds the session in a `ConcurrentDictionary`.
|
||||
|
||||
`DebugStreamBridgeActor` (one per session, short-lived, no persistence):
|
||||
|
||||
1. In `PreStart`, sends `SubscribeDebugViewRequest` to `CentralCommunicationActor` (ClusterClient, for the initial snapshot).
|
||||
2. On receiving `DebugViewSnapshot`, fires `onEvent(snapshot)` and calls `OpenGrpcStream`.
|
||||
3. `OpenGrpcStream` calls `_grpcFactory.GetOrCreate(siteId, endpoint)` and launches `client.SubscribeAsync(...)` as a background task. Domain events are marshalled back to the actor via `Self.Tell` for thread safety.
|
||||
4. On a gRPC error, flips to the other node endpoint and retries (first retry immediate, subsequent retries with `ReconnectDelay` default 5 s). The retry budget (`MaxRetries = 3`) is recovered only after `StabilityWindow` (default 60 s) of uninterrupted connection — a stream that delivers one event then immediately fails does not count as stable.
|
||||
5. On `StopDebugStream`, cancels the gRPC subscription and sends `UnsubscribeDebugViewRequest` to the site via ClusterClient.
|
||||
|
||||
### Proto definition summary
|
||||
|
||||
```proto
|
||||
// Protos/sitestream.proto
|
||||
service SiteStreamService {
|
||||
rpc SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent);
|
||||
rpc IngestAuditEvents(AuditEventBatch) returns (IngestAck);
|
||||
rpc IngestCachedTelemetry(CachedTelemetryBatch) returns (IngestAck);
|
||||
rpc PullAuditEvents(PullAuditEventsRequest) returns (PullAuditEventsResponse);
|
||||
}
|
||||
```
|
||||
|
||||
`SubscribeInstance` carries the real-time data stream. The other three RPCs (`IngestAuditEvents`, `IngestCachedTelemetry`, `PullAuditEvents`) serve the Audit Log component's gRPC telemetry push and reconciliation pull paths — `SiteStreamGrpcServer` hosts them on the same port because sites already listen there.
|
||||
|
||||
`SiteStreamEvent` uses a `oneof event { AttributeValueUpdate, AlarmStateUpdate }` discriminator. `AlarmStateUpdate` carries the full native alarm condition (fields 8–21) alongside the base computed-alarm fields (1–7), added additively so old clients ignoring unknown fields continue to work.
|
||||
|
||||
## Usage
|
||||
|
||||
Central callers interact through `CommunicationService`, which wraps each command pattern in a typed `Ask` with a per-pattern timeout:
|
||||
|
||||
| Pattern | Method | Timeout |
|
||||
|---------|--------|---------|
|
||||
| Instance deployment | `DeployInstanceAsync` | 120 s |
|
||||
| Instance lifecycle | `DisableInstanceAsync`, `EnableInstanceAsync`, `DeleteInstanceAsync` | 30 s |
|
||||
| Artifact deployment | `DeployArtifactsAsync` | 60 s |
|
||||
| Integration routing | `RouteIntegrationCallAsync` | 30 s |
|
||||
| Debug snapshot | `RequestDebugSnapshotAsync` | 30 s |
|
||||
| Remote queries | `QueryEventLogsAsync`, `QueryParkedMessagesAsync`, etc. | 30 s |
|
||||
| OPC UA tag browse | `BrowseNodeAsync` | 30 s |
|
||||
| Notification outbox (central-local) | `QueryNotificationOutboxAsync`, `RetryNotificationAsync`, etc. | 30 s |
|
||||
| Site Call Audit (central-local) | `QuerySiteCallsAsync`, `RetrySiteCallAsync`, etc. | 30 s |
|
||||
|
||||
Notification Outbox and Site Call Audit actors are central-local singletons — their `CommunicationService` methods Ask the proxy directly without wrapping in `SiteEnvelope`.
|
||||
|
||||
For real-time streaming, callers use `DebugStreamService.StartStreamAsync`, which creates a `DebugStreamBridgeActor` and returns a session handle. Ongoing events arrive via the `onEvent` callback; session teardown is via `StopStreamAsync`.
|
||||
|
||||
## Configuration
|
||||
|
||||
All options are bound from the `Communication` section via `CommunicationOptions`:
|
||||
|
||||
| Key | Default | Description |
|
||||
|-----|---------|-------------|
|
||||
| `DeploymentTimeout` | `00:02:00` | Ask timeout for instance deployment commands. |
|
||||
| `LifecycleTimeout` | `00:00:30` | Ask timeout for lifecycle commands (disable, enable, delete). |
|
||||
| `ArtifactDeploymentTimeout` | `00:01:00` | Ask timeout for system-wide artifact deployment. |
|
||||
| `QueryTimeout` | `00:00:30` | Ask timeout for remote queries and management commands. |
|
||||
| `IntegrationTimeout` | `00:00:30` | Ask timeout for integration routing and Inbound API routing. |
|
||||
| `DebugViewTimeout` | `00:00:10` | Ask timeout for debug subscribe/unsubscribe handshake. |
|
||||
| `NotificationForwardTimeout` | `00:00:30` | Ask timeout for notification submission forwarding. |
|
||||
| `CentralContactPoints` | `[]` | Site-side: Akka addresses of central nodes, e.g. `akka.tcp://scadabridge@central-a:8081`. |
|
||||
| `GrpcKeepAlivePingDelay` | `00:00:15` | HTTP/2 keepalive PING interval on `SiteStreamGrpcClient`. |
|
||||
| `GrpcKeepAlivePingTimeout` | `00:00:10` | HTTP/2 keepalive PING timeout. |
|
||||
| `GrpcMaxStreamLifetime` | `04:00:00` | Per-stream session timeout; forces reconnect of zombie streams. |
|
||||
| `GrpcMaxConcurrentStreams` | `100` | Max concurrent `SubscribeInstance` streams per site node. |
|
||||
| `TransportHeartbeatInterval` | `00:00:05` | `SiteCommunicationActor` heartbeat cadence. |
|
||||
| `TransportFailureThreshold` | `00:00:15` | Akka remoting failure-detection threshold. |
|
||||
|
||||
Three layers of dead-client detection protect the gRPC stream path:
|
||||
|
||||
| Layer | Detects | Timeline |
|
||||
|-------|---------|----------|
|
||||
| TCP RST | Clean process death, connection close | ~1–5 s |
|
||||
| gRPC keepalive PING | Network partition, silent crash | ~25 s |
|
||||
| Session timeout (`GrpcMaxStreamLifetime`) | Zombie streams with misconfigured keepalive | 4 h |
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- [Commons (#16)](./Commons.md) — owns all message contracts used by this component: `DeployInstanceCommand`, `SiteEnvelope`, `HeartbeatMessage`, `SiteHealthReport`, `SiteHealthReportReplica`, `RegisterNotificationOutbox`, `RegisterAuditIngest`, `IngestAuditEventsCommand`, `IngestCachedTelemetryCommand`, and all other request/response records. Commons does not hold an Akka package reference, so `RegisterAuditIngest` (which carries an `IActorRef`) lives in this project.
|
||||
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — provides `ClusterClientReceptionist` registration and the active/standby leader model that `SiteCommunicationActor`'s `IsActive` check and `CentralCommunicationActor`'s `DistributedPubSub` fanout both depend on.
|
||||
- [Configuration Database (#17)](./ConfigurationDatabase.md) — provides `ISiteRepository.GetAllSitesAsync` for address loading; site records carry `NodeAAddress`, `NodeBAddress`, `GrpcNodeAAddress`, `GrpcNodeBAddress`.
|
||||
- [Deployment Manager (#2)](./DeploymentManager.md) — the primary consumer of command/control patterns 1–3. `CommunicationService` is injected into the Deployment Manager actor to send deployments, lifecycle commands, and artifact deployments to sites.
|
||||
- [Site Runtime (#3)](./SiteRuntime.md) — `SiteCommunicationActor` forwards inbound commands to the `DeploymentManager` singleton proxy. `SiteStreamManager` (in Site Runtime) implements `ISiteStreamSubscriber` so `SiteStreamGrpcServer` can subscribe relay actors to instance event feeds without referencing Site Runtime directly.
|
||||
- [Health Monitoring (#11)](./HealthMonitoring.md) — `CentralCommunicationActor` calls `ICentralHealthAggregator.MarkHeartbeat` and `ProcessReport` for every inbound heartbeat and health report. `DistributedPubSub` fanout keeps both central nodes' aggregators in sync.
|
||||
- [Audit Log (#23)](./AuditLog.md) — `SiteStreamGrpcServer` hosts `IngestAuditEvents`, `IngestCachedTelemetry`, and `PullAuditEvents` RPCs. `CentralCommunicationActor` routes `IngestAuditEventsCommand` / `IngestCachedTelemetryCommand` ClusterClient messages to the `AuditLogIngestActor` proxy.
|
||||
- [Notification Outbox (#21)](./NotificationOutbox.md) — `CentralCommunicationActor` routes `NotificationSubmit` / `NotificationStatusQuery` messages from sites to the `NotificationOutboxActor` proxy. `CommunicationService` Asks the proxy directly for central-UI outbox management calls.
|
||||
- [Site Call Audit (#22)](./SiteCallAudit.md) — `CommunicationService` Asks the `SiteCallAuditActor` proxy directly for query and relay operations. `SiteCallAuditActor` issues `RetryParkedOperation` / `DiscardParkedOperation` relay commands to sites via `SiteEnvelope`; `SiteCommunicationActor` dispatches them to `_parkedMessageHandler`.
|
||||
- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — the site S&F Engine drives `NotificationSubmit` forwarding and cached-call telemetry emission through `SiteCommunicationActor`. Parked-message queries and retry/discard relay commands flow back the other way.
|
||||
- [Management Service (#18)](./ManagementService.md) — `ManagementActor` is registered with `ClusterClientReceptionist` at `/user/management` on central; the CLI connects via its own separate `ClusterClient`. This is a distinct `ClusterClient` usage from the inter-cluster hub-and-spoke connections managed by this component.
|
||||
- Design spec: [Component-Communication.md](../requirements/Component-Communication.md).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### A site's commands fail immediately
|
||||
|
||||
Check that `NodeAAddress` and `NodeBAddress` are populated in the site configuration — if both are empty, `CentralCommunicationActor` logs a warning and skips that site on every refresh, so no `ClusterClient` is created and all commands timeout. `CommunicationService.RefreshSiteAddresses()` triggers an on-demand refresh after an address is added.
|
||||
|
||||
### Commands are timing out but the site is reachable
|
||||
|
||||
A single malformed address string for one site can silently prevent `ClusterClient` creation for that site while other sites are unaffected. Check the logs for a `Warning` line from `HandleSiteAddressCacheLoaded` naming the offending site. The actor parse-guard catches the `ActorPath.Parse` exception per-site so the rest of the refresh proceeds.
|
||||
|
||||
A `Warning` at the `Status.Failure` handler in `CentralCommunicationActor` means `LoadSiteAddressesFromDb` itself threw (typically a SQL connection error); the cache is left stale until the next successful refresh.
|
||||
|
||||
### gRPC debug stream drops immediately after opening
|
||||
|
||||
`SiteStreamGrpcServer` rejects `correlation_id` values that contain characters invalid in Akka actor names (`/`, whitespace, etc.) with `StatusCode.InvalidArgument`. Verify that the calling `DebugStreamBridgeActor` generates a safe correlation ID.
|
||||
|
||||
After a site node failover, the `DebugStreamBridgeActor` attempts to reconnect to the other node endpoint (`_useNodeA` flips on each error). If both nodes are unreachable, the actor exhausts its 3-retry budget and calls `onTerminated`. The engineer must restart the debug session.
|
||||
|
||||
### Heartbeats arrive but health reports do not
|
||||
|
||||
`SiteCommunicationActor` sends heartbeats and health reports via separate paths. Health reports are sent only when the site's `SiteHealthReportActor` publishes them (every 30 s by default). If heartbeats arrive but reports do not, the health reporting actor on the site may have faulted — check site-side logs for errors in `SiteHealthReportActor`.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Central–Site Communication design specification](../requirements/Component-Communication.md)
|
||||
- [Commons](./Commons.md)
|
||||
- [Cluster Infrastructure](./ClusterInfrastructure.md)
|
||||
- [Configuration Database](./ConfigurationDatabase.md)
|
||||
- [Deployment Manager](./DeploymentManager.md)
|
||||
- [Site Runtime](./SiteRuntime.md)
|
||||
- [Health Monitoring](./HealthMonitoring.md)
|
||||
- [Audit Log](./AuditLog.md)
|
||||
- [Notification Outbox](./NotificationOutbox.md)
|
||||
- [Site Call Audit](./SiteCallAudit.md)
|
||||
- [Store-and-Forward Engine](./StoreAndForward.md)
|
||||
- [Management Service](./ManagementService.md)
|
||||
@@ -0,0 +1,256 @@
|
||||
# Configuration Database
|
||||
|
||||
The Configuration Database component is the exclusive EF Core data-access layer for the central MS SQL configuration store. It owns the `ScadaBridgeDbContext`, every `IEntityTypeConfiguration<T>` Fluent mapping, all repository implementations, the `IAuditService` and `IAuditCorrelationContext` implementations, the `AuditLogPartitionMaintenance` service, and the EF Core migration history. No other component references EF Core or touches the configuration database directly.
|
||||
|
||||
## Overview
|
||||
|
||||
The component lives in `src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase/` and is central-only — site clusters never load it. Its responsibilities break down into four areas:
|
||||
|
||||
- **DbContext + Fluent mappings** — `ScadaBridgeDbContext` maps ~30 Commons POCO entity types to SQL Server using `IEntityTypeConfiguration<T>` classes in `Configurations/`, registered wholesale via `modelBuilder.ApplyConfigurationsFromAssembly(...)`.
|
||||
- **Repository implementations** — eleven scoped repositories implement the interfaces declared in Commons, covering every domain area from template authoring to audit log ingest.
|
||||
- **Config-change audit** — `AuditService` implements `IAuditService`, staging an `AuditLogEntry` into the change tracker so it commits atomically with the entity change; `AuditCorrelationContext` threads a `BundleImportId` through `AsyncLocal<T>` so bundle-import audit rows are correlated without cross-contaminating concurrent import sessions.
|
||||
- **Partition maintenance** — `AuditLogPartitionMaintenance` implements `IPartitionMaintenance`, rolling `pf_AuditLog_Month` forward by issuing `ALTER PARTITION FUNCTION … SPLIT RANGE` for each missing future monthly boundary.
|
||||
|
||||
The single DI entry point is `ServiceCollectionExtensions.AddConfigurationDatabase(string connectionString)`.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Persistence-ignorant Commons entities
|
||||
|
||||
POCO entity classes and repository interfaces are declared in Commons and are entirely free of EF Core attributes. All EF knowledge — column types, max-lengths, indexes, value converters, relationships — lives in the `Configurations/` classes here. Consuming components depend on Commons types only; they never reference this project or EF Core directly.
|
||||
|
||||
### Secret-column encryption
|
||||
|
||||
Three columns carry authentication secrets: `SmtpConfiguration.Credentials`, `ExternalSystemDefinition.AuthConfiguration`, and `DatabaseConnectionDefinition.ConnectionString`. Each uses `EncryptedStringConverter`, an EF `ValueConverter<string?, string?>` that wraps ASP.NET Data Protection. The protector is purpose-scoped to `"ZB.MOM.WW.ScadaBridge.ConfigurationDatabase.EncryptedColumn"` and its key ring is persisted to the database itself (via `IDataProtectionKeyContext`), so both central nodes share one key ring and can read each other's writes.
|
||||
|
||||
`ScadaBridgeDbContext` accepts two constructors: the `(DbContextOptions)` single-argument form used by design-time EF tooling, and the `(DbContextOptions, IDataProtectionProvider)` form used at runtime. The runtime form encrypts; the design-time form substitutes a `SchemaOnlyDataProtector` that produces the same column schema but throws `InvalidOperationException` on any actual read or write, preventing silent encryption with a throwaway key. `AddConfigurationDatabase` always registers the runtime overload:
|
||||
|
||||
```csharp
|
||||
// ServiceCollectionExtensions.AddConfigurationDatabase (excerpt)
|
||||
services.AddScoped(serviceProvider =>
|
||||
{
|
||||
var options = serviceProvider.GetRequiredService<DbContextOptions<ScadaBridgeDbContext>>();
|
||||
var protectionProvider = serviceProvider.GetRequiredService<IDataProtectionProvider>();
|
||||
return new ScadaBridgeDbContext(options, protectionProvider);
|
||||
});
|
||||
|
||||
services.AddDataProtection()
|
||||
.PersistKeysToDbContext<ScadaBridgeDbContext>();
|
||||
```
|
||||
|
||||
A `SecretAwareModelCacheKeyFactory` folds `HasSecretEncryptionProvider` into the EF model cache key so a provider-bearing and a schema-only context never share a cached model.
|
||||
|
||||
### Append-only AuditLog and DB-role enforcement
|
||||
|
||||
The central `dbo.AuditLog` table has two dedicated SQL Server roles:
|
||||
|
||||
| Role | Grants |
|
||||
|------|--------|
|
||||
| `scadabridge_audit_writer` | `INSERT`, `SELECT` on `AuditLog` only — no `UPDATE`, no `DELETE` |
|
||||
| `scadabridge_audit_purger` | `ALTER ON SCHEMA::dbo` (required for `SPLIT RANGE` and partition switch-out) |
|
||||
|
||||
Row-level `DELETE` on `AuditLog` is not granted even to the purge role; retention is always a partition switch, never a row delete.
|
||||
|
||||
## Architecture
|
||||
|
||||
### DbContext
|
||||
|
||||
`ScadaBridgeDbContext` exposes one `DbSet<T>` per mapped entity — templates, instances, sites, data connections, external systems, notifications, shared scripts, security mappings, deployment records, API methods, `AuditLogEntry`, `AuditLogRow` (the `dbo.AuditLog` persistence shape), `SiteCall`, and `DataProtectionKey`. `OnModelCreating` delegates all mapping to the `Configurations/` assembly scan, then applies secret-column encryption and strips computed-column SQL for non-SQL-Server providers (so integration tests using SQLite can still call `EnsureCreated`).
|
||||
|
||||
### Fluent API entity configurations
|
||||
|
||||
Each entity has its own `IEntityTypeConfiguration<T>` in `Configurations/`. Representative examples:
|
||||
|
||||
**`AuditLogEntityTypeConfiguration`** maps `AuditLogRow` to `dbo.AuditLog`. The table carries ten writable canonical columns plus six read-only server-side computed columns derived from `DetailsJson` via `JSON_VALUE … PERSISTED`. EF is configured with `ValueGeneratedOnAddOrUpdate()` and no write for the computed columns; the repository writes only the ten canonical columns and lets SQL Server derive the rest:
|
||||
|
||||
```csharp
|
||||
// AuditLogEntityTypeConfiguration (excerpt)
|
||||
builder.Property(e => e.Kind)
|
||||
.HasConversion<string>()
|
||||
.HasMaxLength(32)
|
||||
.IsUnicode(false)
|
||||
.HasComputedColumnSql("JSON_VALUE(DetailsJson,'$.kind')", stored: true)
|
||||
.ValueGeneratedOnAddOrUpdate()
|
||||
.IsRequired();
|
||||
|
||||
builder.Property(e => e.ExecutionId)
|
||||
.HasComputedColumnSql(
|
||||
"CAST(JSON_VALUE(DetailsJson,'$.executionId') AS uniqueidentifier)", stored: true)
|
||||
.ValueGeneratedOnAddOrUpdate();
|
||||
|
||||
// Composite PK includes OccurredAtUtc for partition alignment
|
||||
builder.HasKey(e => new { e.EventId, e.OccurredAtUtc });
|
||||
|
||||
builder.HasIndex(e => e.EventId).IsUnique()
|
||||
.HasDatabaseName("UX_AuditLog_EventId");
|
||||
```
|
||||
|
||||
**`TemplateConfiguration`** (representative of the domain-area configs) sets up the self-referencing parent FK, folder FK, cascade-delete relationships to attributes/alarms/scripts/compositions/native alarm sources, and the filtered unique index that enforces name uniqueness only on non-derived (base) templates.
|
||||
|
||||
**`SiteCallEntityTypeConfiguration`** maps `SiteCall` to `dbo.SiteCalls` with a `TrackedOperationId` PK stored as `varchar(36)` (GUID in `"D"` format) so the column shape matches the wire format and the site SQLite store — one consistent format for operational debugging.
|
||||
|
||||
### Repository implementations
|
||||
|
||||
All eleven repositories follow the same shape: they take `ScadaBridgeDbContext` by constructor injection, work with Commons POCO types, and never commit — callers invoke `SaveChangesAsync()` to commit the unit of work.
|
||||
|
||||
**`AuditLogRepository`** is the most specialized. Its `InsertIfNotExistsAsync` bypasses the change tracker and issues raw interpolated SQL because the computed columns must not appear in the INSERT column list:
|
||||
|
||||
```csharp
|
||||
// AuditLogRepository.InsertIfNotExistsAsync (excerpt)
|
||||
await _context.Database.ExecuteSqlInterpolatedAsync(
|
||||
$@"IF NOT EXISTS (SELECT 1 FROM dbo.AuditLog WHERE EventId = {evt.EventId})
|
||||
INSERT INTO dbo.AuditLog
|
||||
(EventId, OccurredAtUtc, Actor, Action, Outcome, Category, Target, SourceNode, CorrelationId, DetailsJson)
|
||||
VALUES
|
||||
({evt.EventId}, {occurred}, {actor}, {evt.Action}, {outcome}, {category}, {evt.Target}, {evt.SourceNode}, {evt.CorrelationId}, {evt.DetailsJson});",
|
||||
ct);
|
||||
```
|
||||
|
||||
`FormattableString` interpolation parameterises every value so there is no injection surface. SQL error numbers `2601` and `2627` (unique-index violation) are swallowed as no-ops because the IF NOT EXISTS check has a race window; both the check-loser and the retrying telemetry path are semantically correct duplicates.
|
||||
|
||||
`QueryAsync` builds LINQ predicates over `AuditLogRow` using `AsNoTracking()`, translating filter dimensions (`Channels`, `Kinds`, `Statuses`, `SourceSiteIds`, `SourceNodes`, `ExecutionId`, `ParentExecutionId`, time range) to server-side SQL IN/equality predicates and using keyset pagination on `(OccurredAtUtc DESC, EventId DESC)`.
|
||||
|
||||
`GetExecutionTreeAsync` walks the `ParentExecutionId` graph in two phases: a loop climbs to the root (bounded at 32 levels), then a recursive CTE descends the full tree and LEFT JOINs back to `AuditLog` so stub nodes (purged or row-less executions) still appear with `RowCount = 0`.
|
||||
|
||||
`SwitchOutPartitionAsync` executes a drop-and-rebuild dance — dropping `UX_AuditLog_EventId`, creating a byte-identical staging table (including the computed-column definitions), switching the target partition to staging, dropping staging, and rebuilding the unique index — all inside a single `BEGIN TRY / BEGIN CATCH` block that guarantees the index is present whether the switch succeeds or rolls back.
|
||||
|
||||
### IAuditService — config-change audit
|
||||
|
||||
`AuditService` implements `IAuditService`, called by consuming components after each successful entity mutation. It constructs an `AuditLogEntry` with `Timestamp = DateTimeOffset.UtcNow`, serialises `afterState` to JSON tolerating reference cycles and capping depth at 32 to avoid unbounded payloads, stamps `BundleImportId` from the active `IAuditCorrelationContext`, and adds the entry to the change tracker only — the caller's `SaveChangesAsync()` commits the entry and the entity change atomically:
|
||||
|
||||
```csharp
|
||||
// AuditService.LogAsync (excerpt)
|
||||
var entry = new AuditLogEntry(user, action, entityType, entityId, entityName)
|
||||
{
|
||||
Timestamp = DateTimeOffset.UtcNow,
|
||||
AfterStateJson = afterState != null ? SerializeAfterState(afterState) : null,
|
||||
BundleImportId = _correlationContext.BundleImportId
|
||||
};
|
||||
|
||||
await _context.AuditLogEntries.AddAsync(entry, cancellationToken);
|
||||
```
|
||||
|
||||
`AuditCorrelationContext` backs `BundleImportId` with `AsyncLocal<Guid?>` so each logical call chain — each distinct bundle import invocation — carries its own value even when two imports share a DI scope. It is registered as scoped (to participate in the DI graph) but its in-memory state is per-call-context.
|
||||
|
||||
### Partition maintenance
|
||||
|
||||
`AuditLogPartitionMaintenance` implements `IPartitionMaintenance`. On each tick (driven by the `AuditLogPartitionMaintenanceService` hosted service in the Audit Log component) it reads the current max boundary from `sys.partition_range_values`, then issues `ALTER PARTITION SCHEME … NEXT USED` followed by `ALTER PARTITION FUNCTION … SPLIT RANGE` for each missing month up to the lookahead horizon. The NEXT USED re-issue before every SPLIT is required because SQL Server consumes the flag after the first split. A SPLIT failure propagates (rather than being swallowed) so a failed month blocks subsequent months and the next tick retries from the same boundary — no partition holes.
|
||||
|
||||
## Usage
|
||||
|
||||
### Registration
|
||||
|
||||
The Host calls `AddConfigurationDatabase` once, passing the `ScadaBridge:Database:ConfigurationDb` connection string:
|
||||
|
||||
```csharp
|
||||
// Host composition root (excerpt)
|
||||
services.AddConfigurationDatabase(
|
||||
configuration["ScadaBridge:Database:ConfigurationDb"]!);
|
||||
```
|
||||
|
||||
This registers `ScadaBridgeDbContext` as scoped (with the runtime encryption overload), all eleven repository interfaces bound to their implementations, `IAuditCorrelationContext` → `AuditCorrelationContext`, `IAuditService` → `AuditService`, `IInstanceLocator` → `InstanceLocator`, `IPartitionMaintenance` → `AuditLogPartitionMaintenance`, and the Data Protection key ring persisted to the database.
|
||||
|
||||
The obsolete zero-argument overload throws `InvalidOperationException` at startup (marked `error: true` on the `[Obsolete]` attribute) so a misconfigured host fails fast with a clear message rather than silently producing an empty DI registration.
|
||||
|
||||
### Consuming a repository
|
||||
|
||||
Consuming components resolve the Commons interface through DI and never reference this project:
|
||||
|
||||
```csharp
|
||||
// Example: TemplateEngineRepository usage pattern
|
||||
public class SomeManagementHandler
|
||||
{
|
||||
private readonly ITemplateEngineRepository _repo;
|
||||
private readonly IAuditService _audit;
|
||||
|
||||
public async Task CreateTemplateAsync(Template template, string user, CancellationToken ct)
|
||||
{
|
||||
await _repo.AddTemplateAsync(template, ct);
|
||||
await _audit.LogAsync(user, "Create", "Template",
|
||||
template.Id.ToString(), template.Name, template, ct);
|
||||
await _repo.SaveChangesAsync(ct); // single transaction
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Repository `Add`/`Update`/`Delete` calls only stage changes on the change tracker. `SaveChangesAsync` on the context (exposed via the repository or accessed directly) is the unit-of-work commit.
|
||||
|
||||
### Migration management
|
||||
|
||||
`MigrationHelper.ApplyOrValidateMigrationsAsync` is called at startup after the `ScadaBridgeDbContext` is resolved. It first polls `CanConnectAsync` in a 2-second interval for up to 60 seconds (handling MSSQL container recovery lag), then:
|
||||
|
||||
- **Development** (`isDevelopment = true`): calls `dbContext.Database.MigrateAsync()` to auto-apply all pending migrations.
|
||||
- **Production** (`isDevelopment = false`): calls `GetPendingMigrationsAsync()` and throws `InvalidOperationException` listing the pending migration names if any are outstanding. The host does not start until the schema is current.
|
||||
|
||||
Design-time tooling uses `DesignTimeDbContextFactory`, which reads the connection string from `ScadaBridge:Database:ConfigurationDb` in the Host's `appsettings.json` or from the `SCADABRIDGE_DESIGNTIME_CONNECTIONSTRING` environment variable. No hardcoded fallback exists — a missing connection string fails with an actionable message.
|
||||
|
||||
To generate production SQL scripts:
|
||||
|
||||
```bash
|
||||
# All pending migrations as an idempotent script
|
||||
dotnet ef migrations script --idempotent \
|
||||
--project src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase \
|
||||
--output migration.sql
|
||||
|
||||
# From a specific migration to another
|
||||
dotnet ef migrations script FromMigration ToMigration \
|
||||
--project src/ZB.MOM.WW.ScadaBridge.ConfigurationDatabase \
|
||||
--output migration.sql
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
The connection string is the only configuration this component reads directly. It is injected as a constructor argument to `AddConfigurationDatabase` and sourced from the Host options:
|
||||
|
||||
| Key | Notes |
|
||||
|-----|-------|
|
||||
| `ScadaBridge:Database:ConfigurationDb` | SQL Server connection string. Required; startup fails without it. |
|
||||
|
||||
The `SCADABRIDGE_DESIGNTIME_CONNECTIONSTRING` environment variable is an alternative source for `dotnet ef` tooling only.
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- [Commons (#16)](./Commons.md) — all POCO entity classes (`Templates`, `Instances`, `Sites`, `AuditLogEntry`, `SiteCall`, …) and all repository interfaces (`ITemplateEngineRepository`, `IDeploymentManagerRepository`, `ISecurityRepository`, `IInboundApiRepository`, `IExternalSystemRepository`, `INotificationRepository`, `INotificationOutboxRepository`, `ISiteCallAuditRepository`, `IAuditLogRepository`, `ICentralUiRepository`) live there. Commons also declares `IAuditService`, `IAuditCorrelationContext`, `IPartitionMaintenance`, and `IInstanceLocator` — all implemented here.
|
||||
- [Audit Log (#23)](./AuditLog.md) — `IAuditLogRepository` (implemented by `AuditLogRepository`) is the sole central write path for `dbo.AuditLog`. `AuditLogIngestActor`, `CentralAuditWriter`, and `SiteAuditReconciliationActor` all resolve it from a fresh per-message DI scope; the Audit Log component hosts the `AuditLogPartitionMaintenanceService` and `AuditLogPurgeActor` that drive the `IPartitionMaintenance` implementation registered here.
|
||||
- [Template Engine (#1)](./Commons.md) — consumes `ITemplateEngineRepository` for all template, attribute, alarm, native alarm source, script, composition, instance, override, connection binding, and area operations.
|
||||
- [Deployment Manager (#2)](./Commons.md) — consumes `IDeploymentManagerRepository` for deployment records and configuration snapshots.
|
||||
- [Security & Auth (#10)](./Commons.md) — consumes `ISecurityRepository` for LDAP group mappings and site scoping rules.
|
||||
- [Inbound API (#14)](./Commons.md) — consumes `IInboundApiRepository` for API method definitions.
|
||||
- [External System Gateway (#7)](./Commons.md) — consumes `IExternalSystemRepository` for external system and database connection definitions.
|
||||
- [Notification Service (#8)](./Commons.md) — consumes `INotificationRepository` for notification lists, recipients, and SMTP configuration.
|
||||
- [Notification Outbox (#21)](./NotificationOutbox.md) — consumes `INotificationOutboxRepository` for `dbo.Notifications` ingest, dispatcher polling, status transitions, KPI queries, and bulk purge of terminal rows.
|
||||
- [Site Call Audit (#22)](./SiteCallAudit.md) — consumes `ISiteCallAuditRepository` for `dbo.SiteCalls` ingest, KPI queries, and bulk purge of terminal rows.
|
||||
- [Central UI (#9)](./Commons.md) — consumes `ICentralUiRepository` for read-oriented cross-domain queries and the configuration audit log viewer.
|
||||
- [Host (#15)](./Commons.md) — provides the connection string, calls `AddConfigurationDatabase`, and invokes `MigrationHelper.ApplyOrValidateMigrationsAsync` at startup.
|
||||
- All central components that modify configuration state — call `IAuditService.LogAsync()` and then `SaveChangesAsync()` so audit entries commit atomically with entity changes.
|
||||
- Design spec: [Component-ConfigurationDatabase.md](../requirements/Component-ConfigurationDatabase.md)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Startup fails with "Database schema is out of date"
|
||||
|
||||
The host is running in production mode and `GetPendingMigrationsAsync` found unapplied migrations. Generate the idempotent SQL script (`dotnet ef migrations script --idempotent`) and apply it via SSMS before restarting the host.
|
||||
|
||||
### Startup stalls waiting for the database
|
||||
|
||||
`MigrationHelper` polls `CanConnectAsync` every 2 seconds for up to 60 seconds. If the 60-second deadline elapses the host throws `InvalidOperationException` naming the elapsed time and attempt count. Common causes: SQL Server container still in recovery, wrong connection string, database not yet attached.
|
||||
|
||||
### "Failed to decrypt an encrypted configuration column"
|
||||
|
||||
`EncryptedStringConverter.Unprotect` caught a `CryptographicException`. The Data Protection key ring is unavailable (keys deleted or the database was restored from a backup without the key rows) or the row was written by a different key ring. Restore the `DataProtectionKeys` table rows from a backup or re-provision the key ring and re-encrypt the affected column values.
|
||||
|
||||
### AuditLog partition switch fails mid-operation
|
||||
|
||||
`SwitchOutPartitionAsync` wraps the drop-and-rebuild dance in `BEGIN TRY / BEGIN CATCH`. On failure the CATCH block drops the staging table if it exists and rebuilds `UX_AuditLog_EventId` if it was dropped before the failure. The original exception is re-thrown so the Audit Log purge actor logs it and retries on the next daily tick. Verify that the `scadabridge_audit_purger` role still holds `ALTER ON SCHEMA::dbo` if the operation fails with a permissions error.
|
||||
|
||||
### Design-time `dotnet ef` tooling cannot find a connection string
|
||||
|
||||
Set `ScadaBridge:Database:ConfigurationDb` in the Host's `appsettings.json` (the factory looks for `../ZB.MOM.WW.ScadaBridge.Host` relative to the project directory) or export `SCADABRIDGE_DESIGNTIME_CONNECTIONSTRING`.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Configuration Database design specification](../requirements/Component-ConfigurationDatabase.md)
|
||||
- [Audit Log](./AuditLog.md)
|
||||
- [Notification Outbox](./NotificationOutbox.md)
|
||||
- [Site Call Audit](./SiteCallAudit.md)
|
||||
- [Commons](./Commons.md)
|
||||
@@ -0,0 +1,255 @@
|
||||
# Host
|
||||
|
||||
The Host is the single deployable binary for ScadaBridge. The same executable runs on every node — central and site alike — and selects its component set entirely from configuration, with no separate build targets or conditional compilation.
|
||||
|
||||
## Overview
|
||||
|
||||
Host (#15) is the composition root: it reads `ScadaBridge:Node:Role` from `appsettings.json` (layered with a role-specific override file selected by the `SCADABRIDGE_CONFIG` environment variable), runs pre-DI startup validation, wires every applicable component into the DI container and Akka.NET actor system, and then hands off to ASP.NET Core's `WebApplication` host.
|
||||
|
||||
The component code lives in `src/ZB.MOM.WW.ScadaBridge.Host/`, split across:
|
||||
|
||||
- `Program.cs` — the entry point: configuration loading, `StartupValidator`, role-branched DI registration, Kestrel setup, middleware pipeline, and endpoint mapping.
|
||||
- `Actors/AkkaHostedService.cs` — owns the `ActorSystem` lifetime; builds HOCON from bound options; registers role-specific actors as cluster singletons or plain `ActorOf` calls.
|
||||
- `Actors/DeadLetterMonitorActor.cs` — subscribes to the `DeadLetter` event stream and increments the health metric.
|
||||
- `Health/ActiveNodeGate.cs` — production `IActiveNodeGate` backed by Akka cluster leadership; used by the Inbound API endpoint filter to gate traffic on standby nodes.
|
||||
- `Health/AkkaClusterNodeProvider.cs` — feeds `IClusterNodeProvider` from live Akka cluster membership for health reporting.
|
||||
- `SiteServiceRegistration.cs` — extracted site-role DI registrations reused by both `Program.cs` and integration test harnesses.
|
||||
- `StartupValidator.cs` — pre-DI configuration preflight that fails fast before any actor system is created.
|
||||
- `StartupRetry.cs` — bounded exponential-backoff helper for startup preconditions (database migrations).
|
||||
- `LoggerConfigurationFactory.cs` — builds the Serilog `LoggerConfiguration` with node-identity enrichment.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Role selection via `SCADABRIDGE_CONFIG`
|
||||
|
||||
The configuration builder layers `appsettings.json`, then `appsettings.{SCADABRIDGE_CONFIG}.json`. The `SCADABRIDGE_CONFIG` environment variable selects the role-specific file (`Central` or `Site`); when absent, it falls back to `DOTNET_ENVIRONMENT`. `DOTNET_ENVIRONMENT`/`ASPNETCORE_ENVIRONMENT` remain `Development` for dev tooling (static assets, EF migrations) independently of which role is active.
|
||||
|
||||
```csharp
|
||||
var scadabridgeConfig = Environment.GetEnvironmentVariable("SCADABRIDGE_CONFIG")
|
||||
?? Environment.GetEnvironmentVariable("DOTNET_ENVIRONMENT")
|
||||
?? "Production";
|
||||
|
||||
var configuration = new ConfigurationBuilder()
|
||||
.AddJsonFile("appsettings.json", optional: false)
|
||||
.AddJsonFile($"appsettings.{scadabridgeConfig}.json", optional: true)
|
||||
.AddEnvironmentVariables()
|
||||
.AddCommandLine(args)
|
||||
.Build();
|
||||
```
|
||||
|
||||
The resolved `ScadaBridge:Node:Role` value then branches the entire DI and Akka bootstrap.
|
||||
|
||||
### Pre-DI startup validation
|
||||
|
||||
`StartupValidator.Validate` runs before any DI or actor system setup. It assembles all errors, then throws a single `InvalidOperationException` listing every problem. This avoids the confusing partial-startup failures that occur when validation is deferred to first resolve. Site nodes additionally validate that `GrpcPort`, `MetricsPort`, and `RemotingPort` are all distinct and that no seed-node entry points at the gRPC port.
|
||||
|
||||
### Akka HOCON construction
|
||||
|
||||
`AkkaHostedService.BuildHocon` assembles the HOCON configuration document from strongly-typed options rather than inline strings. Every interpolated value passes through `QuoteHocon` (escapes backslashes and double-quotes) to prevent a hostname, seed-node URI, or split-brain strategy value from corrupting the document. Durations are rendered in milliseconds (`DurationHocon`) so sub-second timing values (e.g. a 750 ms heartbeat) are preserved exactly.
|
||||
|
||||
The actor system name is always `scadabridge`. Site nodes carry two cluster roles: the generic `"Site"` role and a per-site role (`"site-{SiteId}"`) used to scope cluster singletons to a specific site.
|
||||
|
||||
### `/health/ready` — readiness gating
|
||||
|
||||
Central nodes register `DatabaseHealthCheck<ScadaBridgeDbContext>` (tagged `Ready`) and `AkkaClusterHealthCheck` (tagged `Ready`). The `/health/ready` endpoint returns 200 only when both pass. Readiness is explicitly not tied to cluster leadership: a fully operational standby central node still reports ready because `ActiveNodeHealthCheck` carries only the `Active` tag, not `Ready`.
|
||||
|
||||
Load balancers and orchestrators should poll `/health/ready` to determine when a freshly started or failed-over node can receive traffic.
|
||||
|
||||
### `/health/active` — active-node routing for Traefik
|
||||
|
||||
`ActiveNodeHealthCheck` carries the `Active` tag and is served at `/health/active`. It returns 200 only on the cluster leader. Traefik polls this endpoint and routes inbound traffic — Central UI, Inbound API, Management API — exclusively to the node that answers 200. See [TraefikProxy](./TraefikProxy.md) for the upstream routing rules.
|
||||
|
||||
The same leadership check backs `ActiveNodeGate`, the `IActiveNodeGate` implementation the Inbound API endpoint filter consults before executing a method script. A standby node therefore refuses inbound API calls even if traffic somehow reaches it directly.
|
||||
|
||||
```csharp
|
||||
public bool IsActiveNode
|
||||
{
|
||||
get
|
||||
{
|
||||
var system = _akkaService.ActorSystem;
|
||||
if (system == null)
|
||||
return false;
|
||||
|
||||
var cluster = Cluster.Get(system);
|
||||
var self = cluster.SelfMember;
|
||||
if (self.Status != MemberStatus.Up)
|
||||
return false;
|
||||
|
||||
var leader = cluster.State.Leader;
|
||||
return leader != null && leader == self.Address;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Central composition root
|
||||
|
||||
`Program.cs` (Central branch) calls `WebApplication.CreateBuilder`, registers shared and central-only components, builds the `WebApplication`, applies or retries database migrations, and mounts the middleware pipeline and endpoints. The order is intentional: `UseAuthentication` and `UseAuthorization` run before `UseAuditWriteMiddleware` so `HttpContext.User` is populated when the audit row is written.
|
||||
|
||||
`AkkaHostedService.RegisterCentralActors` creates:
|
||||
- `CentralCommunicationActor` — registered with `ClusterClientReceptionist` so site `ClusterClient`s can reach it.
|
||||
- `ManagementActor` — also registered with `ClusterClientReceptionist`; the CLI connects via `ClusterClient` without joining the cluster.
|
||||
- `NotificationOutboxActor` — cluster singleton (no role scope); a proxy is handed to `CentralCommunicationActor` so forwarded `NotificationSubmit` messages from sites are routed to it.
|
||||
- `AuditLogIngestActor` — cluster singleton; proxy registered with both `CentralCommunicationActor` and (if present) the `SiteStreamGrpcServer`.
|
||||
- `SiteCallAuditActor` — cluster singleton; a graceful-stop task is added to the `cluster-leave` coordinated-shutdown phase with a 10-second drain window.
|
||||
- `DeadLetterMonitorActor` — plain `ActorOf`; subscribes to the `DeadLetter` event stream on `PreStart`.
|
||||
|
||||
### Site composition root
|
||||
|
||||
`Program.cs` (Site branch) calls `WebApplication.CreateBuilder` with a Kestrel configuration that binds two listeners: HTTP/2 only on `GrpcPort` (default 8083) for the gRPC server, and HTTP/1+2 on `MetricsPort` (default 8084) for the Prometheus `/metrics` scrape endpoint. The separation exists because a standard HTTP/1.1 Prometheus scraper cannot negotiate HTTP/2; the gRPC listener must stay pure HTTP/2.
|
||||
|
||||
`SiteServiceRegistration.Configure` registers the site-only components. `AkkaHostedService.RegisterSiteActorsAsync` creates:
|
||||
- `DeploymentManagerActor` — cluster singleton scoped to `"site-{SiteId}"`.
|
||||
- `SiteCommunicationActor` — registered with `ClusterClientReceptionist`; creates a `ClusterClient` to configured central contact points.
|
||||
- `SiteReplicationActor` — one per node (not a singleton); handles best-effort S&F replication to the standby.
|
||||
- `EventLogHandlerActor` — cluster singleton scoped to `"site-{SiteId}"`.
|
||||
- `ParkedMessageHandlerActor` — bridges Akka to `StoreAndForwardService`.
|
||||
- `SiteAuditTelemetryActor` — created on a dedicated `audit-telemetry-dispatcher` (2-thread `ForkJoinDispatcher`) so SQLite reads and gRPC pushes never contend with hot-path actors.
|
||||
- `DataConnectionManagerActor` — if `IDataConnectionFactory` is registered.
|
||||
|
||||
Shutdown ordering for the site role is explicit: `IHostApplicationLifetime.ApplicationStopping` fires before `IHostedService.StopAsync`, so `SiteStreamGrpcServer.CancelAllStreams` is called first (clients observe a clean cancellation and reconnect), then `AkkaHostedService` runs `CoordinatedShutdown` and tears down actors.
|
||||
|
||||
```csharp
|
||||
siteLifetime.ApplicationStopping.Register(() => siteGrpcServer.CancelAllStreams());
|
||||
```
|
||||
|
||||
### Database migration retry
|
||||
|
||||
On central nodes, `StartupRetry.ExecuteWithRetryAsync` wraps the migration step with up to 8 attempts and initial 2-second exponential backoff (capped at 30 seconds). Only connection-class faults (`SocketException`, `SqlException`, `DbException`, `TimeoutException`) are retried; a schema-version mismatch surfaces as an `InvalidOperationException` and fails immediately. The `ApplicationStopping` token is threaded into both the migration call and the inter-attempt `Task.Delay` so a SIGTERM during the retry window tears down cleanly.
|
||||
|
||||
## Usage
|
||||
|
||||
The Host is not consumed as a library; it is the executable entry point. Other components expose themselves to the Host via the extension-method convention:
|
||||
|
||||
- `IServiceCollection.AddXxx()` — registers DI services.
|
||||
- `AkkaHostedService.RegisterXxxActors()` / inline `ActorOf` calls in `AkkaHostedService` — registers actors.
|
||||
- `WebApplication.MapXxx()` — maps web endpoints (Central UI, Inbound API, Management API, Audit API).
|
||||
|
||||
`Program.cs` calls these methods; the component libraries own the registration logic. This keeps the Host thin and each component self-contained.
|
||||
|
||||
### Component registration by role
|
||||
|
||||
| Component | Central | Site | `AddXxx` | Actors | `MapXxx` |
|
||||
|---|:---:|:---:|:---:|:---:|:---:|
|
||||
| ClusterInfrastructure | Yes | Yes | Yes | Yes | — |
|
||||
| Communication | Yes | Yes | Yes | Yes | — |
|
||||
| HealthMonitoring | Yes | Yes | Yes | Yes | — |
|
||||
| ExternalSystemGateway | Yes | Yes | Yes | Yes | — |
|
||||
| AuditLog | Yes | Yes | Yes | Yes | — |
|
||||
| NotificationService | Yes | No | Yes | — | — |
|
||||
| NotificationOutbox | Yes | No | Yes | Yes (singleton) | — |
|
||||
| SiteCallAudit | Yes | No | Yes | Yes (singleton) | — |
|
||||
| TemplateEngine | Yes | No | Yes | Yes | — |
|
||||
| DeploymentManager | Yes | No | Yes | Yes | — |
|
||||
| Security | Yes | No | Yes | — | — |
|
||||
| CentralUI | Yes | No | Yes | — | Yes |
|
||||
| InboundAPI | Yes | No | Yes | — | Yes |
|
||||
| ManagementService | Yes | No | Yes | Yes | Yes |
|
||||
| Transport | Yes | No | Yes | — | — |
|
||||
| ConfigurationDatabase | Yes | No | Yes | — | — |
|
||||
| SiteRuntime | No | Yes | Yes | Yes (singleton) | — |
|
||||
| DataConnectionLayer | No | Yes | Yes | Yes | — |
|
||||
| StoreAndForward | No | Yes | Yes | Yes | — |
|
||||
| SiteEventLogging | No | Yes | Yes | Yes (singleton) | — |
|
||||
|
||||
`AuditLog` calls `AddAuditLog` on both roles; central additionally calls `AddAuditLogCentralMaintenance`. Site calls `AddAuditLogHealthMetricsBridge` to bridge write failures into the site health report.
|
||||
|
||||
## Configuration
|
||||
|
||||
Options are bound via the .NET Options pattern (`IOptions<T>`). Each component owns its options class; the Host binds each section and passes the `IConfiguration` to component extension methods only where the component's own validator needs it at startup.
|
||||
|
||||
### `ScadaBridge:Node` → `NodeOptions`
|
||||
|
||||
| Key | Default | Description |
|
||||
|-----|---------|-------------|
|
||||
| `Role` | — | `"Central"` or `"Site"`. Validated by `StartupValidator`. |
|
||||
| `NodeHostname` | — | Hostname or IP advertised to the Akka cluster and enriched on log entries. |
|
||||
| `NodeName` | — | Free-form semantic name stamped as `SourceNode` on audit rows (e.g. `"central-a"`, `"node-b"`). Empty normalises to `null`. |
|
||||
| `SiteId` | — | Site identifier; required for Site nodes; used to scope cluster singletons and enrich telemetry. |
|
||||
| `RemotingPort` | `8081` | Akka.NET remoting TCP port. Must be in range 1–65535. |
|
||||
| `GrpcPort` | `8083` | Kestrel HTTP/2 port for the site gRPC stream server (Site nodes only). Must differ from `RemotingPort`. |
|
||||
| `MetricsPort` | `8084` | Kestrel HTTP/1+2 port for the Prometheus `/metrics` scrape endpoint (Site nodes only). Must differ from both `RemotingPort` and `GrpcPort`. |
|
||||
|
||||
### `ScadaBridge:Cluster` → `ClusterOptions`
|
||||
|
||||
| Key | Default | Description |
|
||||
|-----|---------|-------------|
|
||||
| `SeedNodes` | — | List of Akka seed-node URIs (`akka.tcp://scadabridge@host:port`). At least 2 required. Must reference remoting ports, not gRPC ports. |
|
||||
| `SplitBrainResolverStrategy` | — | Active strategy name (e.g. `"keep-oldest"`). |
|
||||
| `StableAfter` | `"00:00:15"` | Duration the cluster must be stable before the resolver acts. |
|
||||
| `HeartbeatInterval` | `"00:00:02"` | Akka failure-detector heartbeat cadence. |
|
||||
| `FailureDetectionThreshold` | `"00:00:10"` | Acceptable heartbeat pause before a node is considered unreachable. |
|
||||
| `MinNrOfMembers` | `1` | Minimum cluster members before the leader is elected. |
|
||||
| `DownIfAlone` | `true` | When using `keep-oldest`, whether a lone surviving node downs itself. |
|
||||
|
||||
### `ScadaBridge:Database` → `DatabaseOptions`
|
||||
|
||||
| Key | Role | Description |
|
||||
|-----|------|-------------|
|
||||
| `ConfigurationDb` | Central | MS SQL connection string for the central `ScadaBridgeDbContext`. Required; validated by `StartupValidator`. |
|
||||
| `SiteDbPath` | Site | Filesystem path to the site-local SQLite database. Required for Site nodes. |
|
||||
|
||||
### `ScadaBridge:Logging` → `LoggingOptions`
|
||||
|
||||
| Key | Default | Description |
|
||||
|-----|---------|-------------|
|
||||
| `MinimumLevel` | `"Information"` | Serilog minimum log level. Overrides any `Serilog:MinimumLevel` entry — a one-shot warning is emitted to `stderr` if both are present. Parsed case-insensitively; unrecognised values fall back to `Information` with a warning. |
|
||||
|
||||
Serilog sinks (console output template, file path, rolling interval) are configured under the standard `Serilog` JSON section and applied via `ReadFrom.Configuration`. Every log entry is enriched with `SiteId`, `NodeHostname`, and `NodeRole` properties from the resolved node configuration.
|
||||
|
||||
### `ScadaBridge:InboundApi:ApiKeyStore`
|
||||
|
||||
| Key | Default | Description |
|
||||
|-----|---------|-------------|
|
||||
| `SqlitePath` | `data/inbound-api-keys.sqlite` under content root | Path to the SQLite store for inbound API keys. |
|
||||
| `TokenPrefix` | `"sbk"` | Prefix for issued API key tokens. Fixed; injected by the Host as in-memory config. |
|
||||
| `PepperSecretName` | `"ScadaBridge:InboundApi:ApiKeyPepper"` | Configuration key holding the peppered-HMAC secret. The pepper itself must be ≥ 16 characters; validated by `StartupValidator`. |
|
||||
| `RunMigrationsOnStartup` | `true` | Whether the hosted service creates the SQLite schema on first run. |
|
||||
|
||||
All other per-component configuration sections (`ScadaBridge:Communication`, `ScadaBridge:HealthMonitoring`, `ScadaBridge:Security`, `ScadaBridge:InboundApi`, `ScadaBridge:NotificationOutbox`, `ScadaBridge:Transport`, `ScadaBridge:DataConnection`, `ScadaBridge:StoreAndForward`, `ScadaBridge:SiteEventLog`, `ScadaBridge:SiteRuntime`, `ScadaBridge:Notification`) are bound by their respective component extension methods. The Host binds them at the shared `BindSharedOptions` call or at the role-specific `Configure<T>` sites in `Program.cs` and `SiteServiceRegistration.Configure`.
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- **All 19 component libraries** — the Host project-references every component to call its extension methods. The Host is the only project with this fan-out; component libraries do not reference each other except where documented.
|
||||
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — the Host configures the underlying Akka.NET cluster (`AkkaHostedService.BuildHocon`); ClusterInfrastructure manages it at runtime.
|
||||
- [Configuration Database (#17)](./ConfigurationDatabase.md) — the Host registers `ScadaBridgeDbContext` and calls `AddConfigurationDatabase` (Central only); the `StartupRetry`-wrapped migration step runs before traffic is accepted.
|
||||
- [Central–Site Communication (#5)](./Communication.md) — the Host creates `CentralCommunicationActor` and `SiteCommunicationActor`, registers them with `ClusterClientReceptionist`, and wires the `ClusterClient` for site→central messaging; the gRPC server is mapped at `app.MapGrpcService<SiteStreamGrpcServer>()`.
|
||||
- [Health Monitoring (#11)](./HealthMonitoring.md) — the Host registers health checks (`DatabaseHealthCheck`, `AkkaClusterHealthCheck`, `ActiveNodeHealthCheck`) and mounts them via `app.MapZbHealth()` on central; site nodes register `AddSiteHealthMonitoring` and `AkkaHealthReportTransport`.
|
||||
- [Audit Log (#23)](./AuditLog.md) — the Host calls `AddAuditLog` on both roles, `AddAuditLogCentralMaintenance` on central, and `AddAuditLogHealthMetricsBridge` on site; it creates the `AuditLogIngestActor` singleton and registers `SiteAuditTelemetryActor` on the dedicated dispatcher.
|
||||
- [Notification Outbox (#21)](./NotificationOutbox.md) — the Host creates the `NotificationOutboxActor` cluster singleton and hands its proxy to `CentralCommunicationActor`.
|
||||
- [Site Call Audit (#22)](./SiteCallAudit.md) — the Host creates the `SiteCallAuditActor` cluster singleton with a graceful-stop drain task registered in the `cluster-leave` coordinated-shutdown phase.
|
||||
- [Management Service (#18)](./ManagementService.md) — the Host creates `ManagementActor` and registers it with `ClusterClientReceptionist`; maps the Management and Audit HTTP APIs.
|
||||
- [Traefik Proxy (#20)](./TraefikProxy.md) — Traefik polls `/health/active` to determine which central node to route traffic to; the Host implements the `ActiveNodeHealthCheck` and `ActiveNodeGate` that back this endpoint.
|
||||
- Design spec: [Component-Host.md](../requirements/Component-Host.md).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Node fails to start with validation errors
|
||||
|
||||
`StartupValidator` throws before any DI or actor system setup. The exception message lists all failing keys and their expected constraints. Common causes: missing `ScadaBridge:Node:Role`, a `GrpcPort`/`RemotingPort` collision on a site node, a seed-node URI that accidentally points at the gRPC port rather than the remoting port, or a missing `ConfigurationDb` connection string on a central node.
|
||||
|
||||
### Central node loops on database migration
|
||||
|
||||
`StartupRetry` retries connection-class faults up to 8 times (roughly 2 minutes worst-case). If the loop exhausts without success, the process exits with a `Fatal` log entry. Permanent errors (schema-version mismatch detected by `MigrationHelper`) are not retried and exit on the first attempt. Check `SqlException` details in the log to distinguish a connectivity failure from a schema fault.
|
||||
|
||||
### Dead letters appearing at startup
|
||||
|
||||
A burst of dead letters during startup is normal: actors send messages before their targets finish `PreStart`. `DeadLetterMonitorActor` logs each at `Warning` and increments the health counter — these are observable on the site health report. Sustained dead letters after the cluster stabilises indicate a stale actor reference or a lifecycle race.
|
||||
|
||||
### Standby central node receives traffic
|
||||
|
||||
If Traefik is not yet polling `/health/active` or its health-check interval has not elapsed after a failover, traffic may briefly reach the standby. `ActiveNodeGate` returns `false` on the standby, causing the Inbound API endpoint filter to respond `503 Service Unavailable`. The response header `X-ScadaBridge-Active: false` is present so the condition is identifiable in access logs. No operator action is needed; Traefik will reroute on its next health-check cycle.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Host design specification](../requirements/Component-Host.md)
|
||||
- [Cluster Infrastructure](./ClusterInfrastructure.md)
|
||||
- [Central–Site Communication](./Communication.md)
|
||||
- [Configuration Database](./ConfigurationDatabase.md)
|
||||
- [Health Monitoring](./HealthMonitoring.md)
|
||||
- [Audit Log](./AuditLog.md)
|
||||
- [Notification Outbox](./NotificationOutbox.md)
|
||||
- [Site Call Audit](./SiteCallAudit.md)
|
||||
- [Management Service](./ManagementService.md)
|
||||
- [Traefik Proxy](./TraefikProxy.md)
|
||||
@@ -0,0 +1,230 @@
|
||||
# Security & Auth
|
||||
|
||||
The Security & Auth component handles user authentication against an LDAP/Active Directory server and enforces role-based authorization across all central cluster operations. It owns the cookie+JWT hybrid session model, the LDAP-group-to-role mapping pipeline, site-scoped deployment permissions, and the inbound API key management seam.
|
||||
|
||||
## Overview
|
||||
|
||||
Security & Auth (#10) runs exclusively on the central cluster — sites have no user-facing interface and perform no independent authentication. The component code lives in `src/ZB.MOM.WW.ScadaBridge.Security/`, which is a component library (it accepts no `IConfiguration` directly; wiring is Options-pattern only). The Host composition root calls `AddZbLdapAuth` with the `ScadaBridge:Security:Ldap` section before calling `AddSecurity`, because the shared LDAP service is config-coupled and the component library is not allowed to bind configuration itself.
|
||||
|
||||
The component registers:
|
||||
|
||||
- `JwtTokenService` — token generation, validation, idle-timeout enforcement, and sliding refresh logic.
|
||||
- `RoleMapper` — DB-backed LDAP-group-to-role resolution with site-scope union semantics.
|
||||
- `ScadaBridgeGroupRoleMapper` — adapter exposing `RoleMapper` on the shared `IGroupRoleMapper<string>` seam.
|
||||
- `HttpAuditActorAccessor` — resolves the authenticated username from the ambient HTTP context for audit `Actor` stamping.
|
||||
- `LibraryInboundApiKeyAdmin` — implements `IInboundApiKeyAdmin` over the shared `ApiKeyAdminCommands` facade.
|
||||
- ASP.NET Core cookie authentication (sliding idle window, HttpOnly/Secure defaults via `ZbCookieDefaults.Apply`).
|
||||
- Authorization policies (`RequireAdmin`, `RequireDesign`, `RequireDeployment`, `OperationalAudit`, `AuditExport`).
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Direct LDAP bind
|
||||
|
||||
Authentication uses a direct username/password bind against the LDAP/AD server via the shared `ILdapAuthService` (`ZB.MOM.WW.Auth.Ldap`). The flow is: service-account bind → search for the user entry by username → user-credential bind → group-membership query. The app never caches credentials locally. LDAPS (port 636) or StartTLS is required for production; the `AllowInsecure` flag in `LdapOptions` gates unencrypted use to explicitly opted-in deployments (local dev only). No Kerberos/NTLM path exists.
|
||||
|
||||
LDAP failure behavior is fail-closed at the login boundary and fail-open at the session boundary: a new login fails immediately if the directory is unreachable; an active session (valid cookie+JWT) continues with its current claims until the JWT expires. This avoids disrupting engineers mid-work during a brief directory outage. When the directory recovers, the next token refresh re-queries groups and issues a fresh token.
|
||||
|
||||
### Cookie+JWT hybrid session
|
||||
|
||||
On successful login the server mints a JWT via `JwtTokenService.GenerateToken` and writes it into an HttpOnly/Secure authentication cookie. The cookie is the transport — not a bearer header — because Blazor Server's persistent SignalR circuits do not carry `Authorization` headers on reconnect. The browser sends the cookie on every HTTP and SignalR request automatically.
|
||||
|
||||
The JWT is signed with HMAC-SHA256 using a shared symmetric key (`JwtSigningKey`). Both central nodes share the same key, so either node can issue and validate tokens without a shared session store; the load balancer needs no sticky-session configuration. `ClockSkew` is set to `TimeSpan.Zero` to close the standard five-minute tolerance window.
|
||||
|
||||
Claims embedded in every token:
|
||||
|
||||
| Claim type | Constant | Value |
|
||||
|---|---|---|
|
||||
| `ZbClaimTypes.DisplayName` | `JwtTokenService.DisplayNameClaimType` | Human-readable display name |
|
||||
| `ZbClaimTypes.Username` | `JwtTokenService.UsernameClaimType` | Authenticated username |
|
||||
| `ClaimTypes.Role` (URI) | `JwtTokenService.RoleClaimType` | One claim per role |
|
||||
| `ZbClaimTypes.ScopeId` | `JwtTokenService.SiteIdClaimType` | One claim per permitted site (Deployer only) |
|
||||
| `"LastActivity"` | `JwtTokenService.LastActivityClaimType` | ISO 8601 idle-timeout anchor |
|
||||
|
||||
`MapInboundClaims = false` and `MapOutboundClaims` cleared on both mint and validate paths prevent `JwtSecurityTokenHandler`'s default claim-type rewrite maps from transforming the canonical role URI or `zb:` claim types, keeping the type strings byte-for-byte identical in the token and in every policy check.
|
||||
|
||||
### Token lifecycle and idle timeout
|
||||
|
||||
The JWT lifetime (`JwtExpiryMinutes`, default 15 minutes) and the cookie idle window (`IdleTimeoutMinutes`, default 30 minutes) are separate layers. ASP.NET cookie auth's `SlidingExpiration = true` with `ExpireTimeSpan = IdleTimeout` models the idle window: the middleware re-issues the cookie once the session passes its halfway mark, keeping active users signed in. The JWT within that cookie has its own 15-minute expiry.
|
||||
|
||||
`JwtTokenService.ShouldRefresh` checks whether remaining JWT lifetime is below `JwtRefreshThresholdMinutes` (default 5 minutes). `RefreshToken` issues a fresh JWT while **preserving** the existing `LastActivity` anchor — a background refresh is not treated as user activity. `RecordActivity` advances the anchor to now. `IsIdleTimedOut` checks whether the elapsed time since `LastActivity` exceeds `IdleTimeoutMinutes`; `RefreshToken` enforces the idle check internally so an idle-expired session cannot be kept alive by background polling regardless of caller discipline (Security-014).
|
||||
|
||||
## Architecture
|
||||
|
||||
### Registration split between Host and component
|
||||
|
||||
`AddSecurity` (component library) registers everything except the LDAP service itself:
|
||||
|
||||
```csharp
|
||||
public static IServiceCollection AddSecurity(this IServiceCollection services)
|
||||
{
|
||||
services.AddScoped<JwtTokenService>();
|
||||
services.AddScoped<RoleMapper>();
|
||||
services.AddHttpContextAccessor();
|
||||
services.AddSingleton<IAuditActorAccessor, HttpAuditActorAccessor>();
|
||||
services.AddScoped<IGroupRoleMapper<string>, ScadaBridgeGroupRoleMapper>();
|
||||
|
||||
services.AddAuthentication(CookieAuthenticationDefaults.AuthenticationScheme)
|
||||
.AddCookie(options =>
|
||||
{
|
||||
options.LoginPath = "/login";
|
||||
options.LogoutPath = "/auth/logout";
|
||||
});
|
||||
|
||||
services.AddOptions<CookieAuthenticationOptions>(CookieAuthenticationDefaults.AuthenticationScheme)
|
||||
.Configure<IOptions<SecurityOptions>, ILoggerFactory>((cookieOptions, securityOptions, loggerFactory) =>
|
||||
{
|
||||
ZbCookieDefaults.Apply(
|
||||
cookieOptions,
|
||||
requireHttps: securityOptions.Value.RequireHttpsCookie,
|
||||
idleTimeout: TimeSpan.FromMinutes(securityOptions.Value.IdleTimeoutMinutes));
|
||||
|
||||
var cookieName = securityOptions.Value.CookieName;
|
||||
cookieOptions.Cookie.Name = string.IsNullOrWhiteSpace(cookieName)
|
||||
? SecurityOptions.DefaultCookieName
|
||||
: cookieName;
|
||||
});
|
||||
|
||||
services.AddScadaBridgeAuthorization();
|
||||
return services;
|
||||
}
|
||||
```
|
||||
|
||||
The Host composition root calls `AddZbLdapAuth(configuration, LdapSectionPath)` before `AddSecurity()`. `AddZbLdapAuth` registers `ILdapAuthService` as a singleton, binds `LdapOptions` from `ScadaBridge:Security:Ldap`, and registers `IValidateOptions<LdapOptions>` with `ValidateOnStart` so a misconfigured directory fails at boot rather than at first login.
|
||||
|
||||
### Role mapping and site scoping
|
||||
|
||||
`RoleMapper.MapGroupsToRolesAsync` loads all `LdapGroupMapping` rows from the database, matches the supplied LDAP group names (case-insensitive), and accumulates roles. For the `Deployer` role it also loads associated `SiteScopeRule` rows — each row carries a `SiteId` limiting that mapping to one site. Union semantics (Security-016): if any matched Deployer mapping has no scope rules, the result is system-wide and all accumulated site IDs are discarded:
|
||||
|
||||
```csharp
|
||||
var isSystemWide = hasUnscopedDeploymentMapping
|
||||
|| (hasDeploymentRole && !hasScopedDeploymentMapping);
|
||||
|
||||
if (isSystemWide)
|
||||
{
|
||||
permittedSiteIds.Clear();
|
||||
}
|
||||
|
||||
return new RoleMappingResult(
|
||||
matchedRoles.ToList(),
|
||||
permittedSiteIds.ToList(),
|
||||
isSystemWide);
|
||||
```
|
||||
|
||||
`ScadaBridgeGroupRoleMapper` adapts `RoleMappingResult` onto the shared `IGroupRoleMapper<string>` seam, carrying the full `RoleMappingResult` (including `PermittedSiteIds` and `IsSystemWideDeployment`) as the opaque `Scope` field so no site-scope information is lost at the seam boundary.
|
||||
|
||||
### Authorization policies
|
||||
|
||||
Five named policies are registered by `AuthorizationPolicies.AddScadaBridgeAuthorization`. Every policy uses `RequireClaim(JwtTokenService.RoleClaimType, ...)` — no custom requirement handlers — making the policy check a direct look-up into the JWT's role claims.
|
||||
|
||||
| Policy | Constant | Roles satisfied |
|
||||
|---|---|---|
|
||||
| `RequireAdmin` | `AuthorizationPolicies.RequireAdmin` | `Administrator` |
|
||||
| `RequireDesign` | `AuthorizationPolicies.RequireDesign` | `Designer` |
|
||||
| `RequireDeployment` | `AuthorizationPolicies.RequireDeployment` | `Deployer` |
|
||||
| `OperationalAudit` | `AuthorizationPolicies.OperationalAudit` | `Administrator`, `Viewer` |
|
||||
| `AuditExport` | `AuthorizationPolicies.AuditExport` | `Administrator` |
|
||||
|
||||
Role names are declared in `Roles` (the single source of truth). The four active roles (`Administrator`, `Designer`, `Deployer`, `Viewer`) are the canonical subset of the shared six-role vocabulary; `Operator` and `Engineer` exist upstream but are not used. The `OperationalAudit` and `AuditExport` roles arrays are public (`AuthorizationPolicies.OperationalAuditRoles`, `AuditExportRoles`) so the ManagementService HTTP API can reuse the exact same sets when gating `/api/audit/*` routes through its own Basic-Auth + LDAP role check.
|
||||
|
||||
### LDAP failure messages
|
||||
|
||||
`LdapAuthFailureMessages.ToMessage` maps the structured `LdapAuthFailure` enum from the shared library to user-facing strings. `BadCredentials` and `UserNotFound` both return the generic "Invalid username or password." — intentionally identical to prevent username enumeration. `AmbiguousUser` and `ServiceAccountBindFailed` (which also covers a directory that is unreachable at connect/search time) return a misconfiguration message. `GroupLookupFailed` (post-bind directory outage, or a successful-but-empty group set) returns a transient-outage message.
|
||||
|
||||
### Inbound API key management
|
||||
|
||||
`LibraryInboundApiKeyAdmin` implements the Commons `IInboundApiKeyAdmin` seam over the shared `ApiKeyAdminCommands` facade. Keys use the `sbk_<keyId>_<secret>` token format (prefix `sbk`), with the key ID as a 32-hex-character GUID (`"N"` format, no hyphens, because hyphens are not valid in the delimiter-separated token). The library stores keys in a SQLite file (`data/inbound-api-keys.sqlite` by default). Scopes in the library map 1:1 to method names in ScadaBridge. Delete is implemented as revoke-then-delete because the library only permits deleting an already-revoked key.
|
||||
|
||||
### Data Protection key sharing
|
||||
|
||||
`AddConfigurationDatabase` calls `AddDataProtection().PersistKeysToDbContext<ScadaBridgeDbContext>()`. Both central nodes therefore read and write Data Protection keys from the same MS SQL database, which means either node can protect and unprotect the same data (including the cookie payload) regardless of which node issued it — a prerequisite for load-balancer failover transparency.
|
||||
|
||||
## Usage
|
||||
|
||||
Login flow (Central UI `/auth/login` and `/auth/token`):
|
||||
|
||||
1. Call `ILdapAuthService.AuthenticateAsync(username, password)` (registered by Host via `AddZbLdapAuth`).
|
||||
2. On success, call `RoleMapper.MapGroupsToRolesAsync(ldapGroups)` to resolve roles and site scope.
|
||||
3. Call `JwtTokenService.GenerateToken(displayName, username, roles, permittedSiteIds)` to mint a signed JWT.
|
||||
4. Write the JWT into the HttpOnly cookie via the ASP.NET cookie auth `SignInAsync`.
|
||||
|
||||
On each subsequent request, middleware reads the cookie, validates the embedded JWT with `JwtTokenService.ValidateToken`, and checks `IsIdleTimedOut`. If the token is near expiry (`ShouldRefresh`), fresh claims are re-queried from LDAP and `RefreshToken` issues a replacement. Genuine user interactions call `RecordActivity` to advance the last-activity anchor.
|
||||
|
||||
Authorization gates use the named policies:
|
||||
|
||||
```csharp
|
||||
// Razor page or controller — declarative
|
||||
[Authorize(Policy = AuthorizationPolicies.RequireDesign)]
|
||||
|
||||
// ManagementActor — imperative, reusing the same role arrays
|
||||
if (!AuthorizationPolicies.OperationalAuditRoles.Contains(userRole))
|
||||
return Unauthorized();
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
`SecurityOptions` is bound from the `ScadaBridge:Security` section. LDAP connection settings are bound separately from `ScadaBridge:Security:Ldap` (the `LdapSectionPath` constant) into the shared `LdapOptions` by `AddZbLdapAuth`.
|
||||
|
||||
### `ScadaBridge:Security` — `SecurityOptions`
|
||||
|
||||
| Key | Default | Description |
|
||||
|---|---|---|
|
||||
| `JwtSigningKey` | *(required)* | Symmetric HMAC-SHA256 signing key. Must be at least 32 bytes (256 bits); validated at `JwtTokenService` construction — startup fails if too short. |
|
||||
| `JwtExpiryMinutes` | `15` | JWT lifetime in minutes before embedded token expires and must be refreshed. |
|
||||
| `JwtRefreshThresholdMinutes` | `5` | Minutes before JWT expiry at which `ShouldRefresh` triggers a re-issue. |
|
||||
| `IdleTimeoutMinutes` | `30` | Session idle timeout in minutes. Cookie `ExpireTimeSpan` is set to this value; `IsIdleTimedOut` enforces it from the `LastActivity` claim. |
|
||||
| `RequireHttpsCookie` | `true` | When `true`, the cookie is `Secure`-only (HTTPS required). Set `false` for HTTP-only dev deployments; a warning is logged at startup. |
|
||||
| `CookieName` | `ZB.MOM.WW.ScadaBridge.Auth` | Authentication cookie name. Override per deployment when two ScadaBridge stacks share a hostname — browsers scope cookies by host+path but not by port. |
|
||||
|
||||
### `ScadaBridge:Security:Ldap` — `LdapOptions` (shared library)
|
||||
|
||||
| Key | Description |
|
||||
|---|---|
|
||||
| `Server` | LDAP/AD server hostname or IP. |
|
||||
| `Port` | LDAP port. Use 636 for LDAPS or 389 for StartTLS. |
|
||||
| `Transport` | `Ldaps`, `StartTls`, or `None` (dev only — requires `AllowInsecure = true`). |
|
||||
| `AllowInsecure` | Must be `true` to permit `Transport = None`. Default `false`. |
|
||||
| `SearchBase` | LDAP search base DN (e.g. `dc=corp,dc=example,dc=com`). |
|
||||
| `ServiceAccountDn` | Service-account distinguished name used for the initial bind and group search. |
|
||||
| `ServiceAccountPassword` | Service-account password. |
|
||||
|
||||
`LdapOptionsValidator` (registered with `ValidateOnStart` by `AddZbLdapAuth`) enforces that `Server`, `SearchBase`, `ServiceAccountDn`, and a secure transport are configured before the first request is served.
|
||||
|
||||
## Dependencies & Interactions
|
||||
|
||||
- [Commons (#16)](./Commons.md) — defines `ISecurityRepository` (LDAP mapping + scope rule read/write), `IInboundApiKeyAdmin` (key admin seam), `IAuditActorAccessor` (audit actor resolution), `LdapGroupMapping`, and `SiteScopeRule` entities, plus `ManagementEnvelope` (carries username/roles into every Management command).
|
||||
- [Configuration Database (#17)](./ConfigurationDatabase.md) — provides the scoped `ISecurityRepository` implementation (`SecurityRepository`) backed by `LdapGroupMappings` and `SiteScopeRules` tables in MS SQL, and hosts the Data Protection key ring via `PersistKeysToDbContext<ScadaBridgeDbContext>()`.
|
||||
- [Central UI (#9)](./CentralUI.md) — every Blazor Server page and Razor component passes through cookie authentication and named policy authorization. The login page drives the LDAP bind → role map → token mint flow. The Admin → LDAP Mappings page is gated by `RequireAdmin` and calls `ISecurityRepository` directly.
|
||||
- [Management Service (#18)](./ManagementService.md) — the `ManagementActor` enforces role and site-scope rules on every incoming command using identity carried in the `ManagementEnvelope`. The CLI authenticates users via the same LDAP bind and passes identity in every request.
|
||||
- [Inbound API (#14)](./InboundAPI.md) — inbound API requests authenticate via `X-API-Key` (library verifier, `sbk_*` token format) rather than the cookie/JWT path. `HttpAuditActorAccessor` resolves the authenticated username for audit `Actor` stamping on the interactive path; the inbound API path keeps its own actor/fallback.
|
||||
- [Audit Log (#23)](./AuditLog.md) — `IAuditActorAccessor` is a seam this component implements; the Inbound API audit path calls `CurrentActor` to record the authenticated user as the event actor.
|
||||
- [Transport (#24)](./Transport.md) — export gates on `RequireDesign`; import gates on `RequireAdmin`, enforced at both the Razor page layer and inside the Transport service entrypoints.
|
||||
- Design spec: [Component-Security.md](../requirements/Component-Security.md).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Login fails: "Authentication service is misconfigured"
|
||||
|
||||
This message maps from `LdapAuthFailure.ServiceAccountBindFailed` or `LdapAuthFailure.AmbiguousUser`. The service-account DN or password in `ScadaBridge:Security:Ldap` is wrong, the LDAP server is unreachable at connect time, or the search for the username returned more than one entry. Check `ServiceAccountDn`, `ServiceAccountPassword`, and `Server` in configuration. `LdapOptionsValidator` enforces these keys at startup, so a complete absence fails fast — this error at login time indicates a runtime connectivity or data problem.
|
||||
|
||||
### Login fails: "The directory is temporarily unavailable"
|
||||
|
||||
Maps from `LdapAuthFailure.GroupLookupFailed`. The user-credential bind succeeded but the subsequent group-membership query failed. The directory is partially reachable (user bind works) but the group search is failing. Existing sessions with valid JWTs continue to operate unaffected.
|
||||
|
||||
### Session expires unexpectedly
|
||||
|
||||
Check `IdleTimeoutMinutes` and `JwtExpiryMinutes` in `SecurityOptions`. A background refresh that fires while the user is idle preserves the `LastActivity` anchor (`RefreshToken` does not advance it); `IsIdleTimedOut` enforces the window from the last genuine user interaction. If the idle timeout fires before the expected window, confirm that `RecordActivity` is being called on genuine user requests.
|
||||
|
||||
### Two ScadaBridge environments on the same host clobber each other's session
|
||||
|
||||
Set a distinct `CookieName` in `ScadaBridge:Security` for each deployment. Browsers scope cookies by host+path, not by port, so two stacks on `localhost:9000` and `localhost:9100` share cookie namespace under the default name.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Security & Auth design specification](../requirements/Component-Security.md)
|
||||
- [Configuration Database](./ConfigurationDatabase.md)
|
||||
- [Commons](./Commons.md)
|
||||
- [Central UI](./CentralUI.md)
|
||||
- [Management Service](./ManagementService.md)
|
||||
- [Inbound API](./InboundAPI.md)
|
||||
- [Audit Log](./AuditLog.md)
|
||||
- [Transport](./Transport.md)
|
||||
Reference in New Issue
Block a user