Persist the Galaxy browse dataset to disk for offline startup

The gateway can lose connectivity to the Galaxy database, and the
database is often unreachable exactly when the gateway restarts. The
hierarchy cache was purely in-memory, so a cold start with no database
left clients with an Unavailable browse surface until SQL came back.

Add a JSON snapshot store: each successful heavy refresh writes the raw
hierarchy and attribute rowsets to disk atomically (temp file + rename),
and the first refresh after startup restores that snapshot before any
SQL runs. Restored data is served as Stale until a live query confirms
it; a live query that observes the same time_of_last_deploy promotes it
to Healthy with no heavy re-query.

Persistence is on by default (MxGateway:Galaxy:PersistSnapshot) and
writes to C:\ProgramData\MxGateway\galaxy-snapshot.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-22 02:03:00 -04:00
parent aba228f443
commit fa491c752b
11 changed files with 739 additions and 32 deletions
+32
View File
@@ -89,6 +89,36 @@ load to complete before returning. If the first load fails or times out,
the client gets `Unavailable` with a short reason. Once any load completes
(success or failure), this wait is skipped on subsequent calls.
### On-disk snapshot
The gateway may lose connectivity to the Galaxy database — and the database is
often unreachable right when the gateway itself restarts. To keep browse
working across that gap, the cache persists its dataset to disk:
- After every successful **heavy** refresh (a deploy change), the raw
hierarchy and attribute rowsets are written to
`MxGateway:Galaxy:SnapshotCachePath`
(default `C:\ProgramData\MxGateway\galaxy-snapshot.json`). The write is
atomic — a temp file plus rename — so a crash mid-write cannot corrupt the
snapshot. Cheap no-change ticks write nothing; the file is already current.
- On the **first** refresh after startup, before any SQL runs, the cache
reloads that file. The restored data is served with `Stale` status —
it is last-known data, not live — so clients can browse immediately even
when the Galaxy database is unreachable.
- The first live query then reconciles: if it observes the **same**
`time_of_last_deploy` the snapshot was saved at, the entry is promoted to
`Healthy` with no heavy re-query (the snapshot is provably current); if it
observes a newer deploy, the heavy queries run and replace the snapshot; if
the database is still unreachable, the entry stays `Stale`.
`is_alarm` / `is_historized` filters, paging, and the dashboard summary all
work against a restored snapshot exactly as against a live pull — the restore
path runs the same materialization. Persistence is disabled by setting
`MxGateway:Galaxy:PersistSnapshot` to `false`; the snapshot file is then
neither written nor read, and a cold start with an unreachable database comes
up `Unavailable` as before. The on-disk file is a cache, not a system of
record: deleting it only forces the next cold start to wait for live SQL.
## Deploy Notifications
`WatchDeployEvents` is a server-streaming RPC backed by
@@ -291,6 +321,8 @@ Bound to `MxGateway:Galaxy` via `GalaxyRepositoryOptions`.
|--------|---------|-------------|
| `MxGateway:Galaxy:ConnectionString` | `Server=localhost;Database=ZB;Integrated Security=True;TrustServerCertificate=True;Encrypt=False;` | SQL Server connection string for the Galaxy Repository. Integrated Security against `localhost` is the dev default; production deployments should override this through the standard double-underscore environment variable form, e.g. `MxGateway__Galaxy__ConnectionString`. |
| `MxGateway:Galaxy:CommandTimeoutSeconds` | `60` | Per-command SQL timeout. Applies to all three RPCs. |
| `MxGateway:Galaxy:PersistSnapshot` | `true` | Persists each successful browse dataset to disk and reloads it at startup. See [On-disk snapshot](#on-disk-snapshot). |
| `MxGateway:Galaxy:SnapshotCachePath` | `C:\ProgramData\MxGateway\galaxy-snapshot.json` | File path for the persisted browse snapshot. Ignored when `PersistSnapshot` is `false`. |
The connection string is not treated as a secret in dev (`Integrated
Security`), but production deployments that use SQL authentication should set