Open KnowledgeOpen Knowledge
Internals

Server lifecycle

Startup initialization, graceful shutdown phases, and the data-loss guarantees of createServer().destroy().

createServer() in packages/server/src/standalone.ts returns a ServerInstance with two lifecycle handles: a ready promise for startup completion and a destroy() function for graceful shutdown. Understanding the lifecycle matters for CLI signal handlers, Electron window lifecycle events, and integration tests.

This page covers the collab-server lifecycle owned by ok start (holds server.lock). The ok ui sibling has a simpler lifecycle — it acquires ui.lock, serves the React bundle + /api/config, and either exits on SIGTERM from ok start's idle-shutdown or fires its own 12h D-025 safety-net timer. Both locks are minted by the shared acquireProcessLock factory in packages/server/src/process-lock.ts; server-lock.ts and ui-lock.ts are thin adapters that pin lockName: 'server' / 'ui'.

Startup

createServer() returns synchronously. The very first side effect is acquiring the server lock; heavy initialization then runs in the background and resolves through the ready promise.

createServer(options)          ← synchronous, returns ServerInstance
  ├─ acquireServerLock         ← <contentDir>/.open-knowledge/server.lock
  │                              (thin adapter over acquireProcessLock;
  │                               throws ServerLockCollisionError — a subclass
  │                               of ProcessLockCollisionError — if a live
  │                               same-host PID already holds it)
  ├─ Hocuspocus instance       ← immediate (in-memory)
  ├─ ContentFilter             ← immediate (compiled patterns)
  ├─ AgentSessionManager       ← immediate
  ├─ Persistence extension     ← immediate (hooks registered)
  ├─ API extension             ← immediate (HTTP routes registered)
  └─ ready = initAsync()       ← async background init
       ├─ Shadow repo init     ← git plumbing (~50-200ms)
       ├─ HEAD-drift check     ← compare last-known-head vs current HEAD;
       │                         commitUpstreamImport() if diverged
       ├─ File watcher start   ← @parcel/watcher subscribe
       ├─ HEAD watcher start   ← @parcel/watcher subscribe on .git/
       └─ SyncEngine start     ← remote detection → auto-sync timers (if remote exists)

The lock is written with port: 0 as a "starting, not yet bound" sentinel. The CLI (packages/cli/src/commands/start.ts) and the Vite dev plugin (packages/app/src/server/hocuspocus-plugin.ts) both call updateServerLockPort(lockDir, realPort) after http.listen() resolves so downstream tools (MCP, external clients) can read the actual port from the lock file.

ServerInstance.lockDir is exposed on the returned instance for this purpose. Mutations are ownership-guarded — a process whose pid does not match the lock's refuses to rewrite it.

Callers can await serverInstance.ready before accepting traffic, but it is not required -- Hocuspocus accepts WebSocket connections immediately and queues document loads until initialization completes.

Graceful shutdown

destroy() tears down the server in six ordered phases. Each phase has independent error handling so a failure in one phase does not skip subsequent phases.

destroy()
  ├─ await ready               ← wait for init to finish (prevents leaked subscriptions)
  ├─ Phase 1: Stop watchers    ← file watcher + HEAD watcher unsubscribe
  ├─ Phase 2: Drain sessions   ← AgentSessionManager.closeAll()
  ├─ Phase 3: L1 flush         ← flushAllStoresAndWait() — Y.Doc → markdown → disk
  ├─ Phase 4: L2 flush         ← flushPendingGitCommit() + waitForPendingCommits() — disk → git
  ├─ Phase 4.5: SyncEngine     ← syncEngine.destroy() — persists state to sync-state.json
  ├─ Phase 5: Shadow release   ← write last-known-head + destroyShadowRepo() — always runs
  └─ Phase 6: Server lock      ← releaseServerLock() — LAST, in finally block

Phase ordering rationale

  • Phase 1 before Phase 3: Watchers must stop before L1 writes hit disk, otherwise the file watcher detects the persistence write as an "external change" and triggers a reconciliation loop.
  • Phase 2 before Phase 3: Agent sessions hold DirectConnections to Y.Docs. Closing sessions first ensures no new writes arrive during the drain.
  • Phase 3 before Phase 4: L1 (onStoreDocument) writes markdown to disk and schedules the L2 git-commit timer. Running L2 before L1 would drain an empty commit queue, causing silent data loss.
  • Phase 5 before Phase 6: The shadow writer lock protects git plumbing; release it before the process-level server lock so a restart sees a clean shadow root even if Phase 6 throws.
  • Phase 6 in finally: The server lock is the final release and is wrapped in a try/finally around all prior phases. A mid-shutdown throw in Phase 1–5 must still release server.lock, otherwise the next start would see a stale lock from a process that cleanly exited. Ownership-guarded — only a process whose pid matches the lock's removes it.

L1 drain: flushAllStoresAndWait()

Hocuspocus's built-in flushPendingStores() is fire-and-forget (its return type is void). To get an awaitable drain, flushAllStoresAndWait() mirrors the pattern used by @hocuspocus/server's internal Server.destroy():

  1. Push a one-shot afterUnloadDocument extension hook onto hocuspocus.configuration.extensions
  2. Close all connections via hocuspocus.closeConnections()
  3. Call hocuspocus.flushPendingStores() to trigger immediate store execution
  4. Await a promise that resolves when hocuspocus.getDocumentsCount() === 0

If no documents are loaded (hocuspocus.documents.size === 0), the helper returns immediately.

A Promise.race with a configurable timeout (destroyTimeoutMs, default 10 seconds) prevents infinite hangs from misbehaving onStoreDocument hooks. If the timeout fires, Phase 3 records the error in phaseErrors and continues to Phase 4.

Rescue buffer on flush timeout

When flushAllStoresAndWait() hits its destroyTimeoutMs ceiling with documents still loaded, each pending document's in-memory Y.Doc state is dumped to <shadow-gitDir>/rescue/<docName>.md before the timeout error propagates. The rationale: a timeout means onStoreDocument did not complete, so the in-memory Y.Doc is the only remaining record of the user's recent edits. Unlike the reconciliation-path rescue (which dumps only when the in-memory state diverges from the last reconciled base), the destroy-timeout rescue writes unconditionally — every still-loaded document gets a rescue file.

Rescue writes are best-effort per document: a serialization or writeFile failure for one document is logged via getLogger('server') under the [rescue] category and does not prevent the others from being rescued. The Phase 3 timeout error message names which documents were rescued versus lost, so operators can correlate the warn-level shutdown log with the on-disk rescue files:

flushAllStoresAndWait timeout after 10000ms — 2/3 docs did not unload: [my-notes, drafts]
  — rescued [my-notes], lost [drafts]

If the shadow repo failed to initialize (the shadowRef.current closure is nullish), rescue writes are skipped entirely and all still-loaded documents are reported as lost.

Users can recover rescued content via the existing GET /api/rescue (list) and GET /api/rescue/:docName (retrieve) endpoints. Rescue files older than 24 hours are cleaned up automatically.

Concurrent destroy: idempotency guard

A cached-promise guard (inflightDestroy) ensures that concurrent destroy() calls -- common when both SIGINT and SIGTERM arrive -- share a single teardown:

let inflightDestroy: Promise<void> | null = null;

async function destroy(): Promise<void> {
  if (inflightDestroy) return inflightDestroy;
  inflightDestroy = (async () => { /* ... phases ... */ })();
  return inflightDestroy;
}

The second caller awaits the same promise as the first. No phases run twice.

Shutdown log

Every destroy() call emits a structured log entry via getLogger('server') after all phases complete:

  • Clean exit: log.info with payload { documentCount, durationMs } and message "[server] shutdown flushed N documents in Mms".
  • Partial failure: log.warn with additional phaseErrors array, where each entry has { phase, error } identifying which phase failed and why.

Example clean output (structured JSON via pino):

{"level":"info","msg":"[server] shutdown flushed 3 documents in 247ms","documentCount":3,"durationMs":247}

destroyTimeoutMs option

ServerOptions.destroyTimeoutMs controls how long Phase 3 waits for all pending stores to drain before timing out. Default is 10_000 (10 seconds).

  • Tests typically pass 500 to reclaim CI wall-time.
  • Slow-disk or NFS environments may need a higher value if legitimate L1 flushes (markdown serialization + atomic file write per document) exceed 10 seconds.

This option is on the programmatic ServerOptions interface, not the YAML config schema. It is set by the caller of createServer() (CLI, Electron main process, test harness).

Data-loss guarantee boundary

After a successful destroy(), all writes that reached the server-side Y.Doc are flushed to disk (L1) and committed to the shadow git repo (L2). The guarantee boundary is:

LayerGuaranteed by destroy()
Server Y.Doc → markdown on disk (L1)Yes
Disk → shadow git commit (L2)Yes
Client WebSocket buffer → server Y.DocNo -- out of scope

The client-side transmission buffer (~16ms between a keystroke and its WebSocket flush to the server) is a separate, smaller concern. For CLI users, this client-side residual window is negligible compared to the server-side guarantees above. For Electron, a client-side drain barrier (e.g., flushing all active providers before closing the window) would need to be implemented separately -- this is not yet built.

Idle-shutdown

ok start wires attachIdleShutdown (packages/server/src/idle-shutdown.ts) over the Hocuspocus HTTP listener. It counts WebSocket upgrade events on paths starting with /collab and fires onShutdown after the configured threshold of zero clients (default 30 min; WARN log at 25 min).

"Idle" is WebSocket-only by design. DirectConnections opened by server-internal subsystems (CC1 broadcaster pre-materializing __system__, AgentSessionManager per-agent sessions) are invisible to the counter. The rationale (SPEC §9 D-017): DirectConnections are plentiful and long-lived; counting them would keep the server alive forever on empty worktrees. The cost is that an agent session left dangling at midnight doesn't keep the server alive — considered acceptable per NG10 in the Zero-Ceremony Resume SPEC.

On fire, onShutdown looks up ui.lock, SIGTERMs the ok ui sibling if alive, and then awaits destroy() — which runs the six-phase shutdown above, releasing server.lock as its final step. A try/finally guarantees the lock is released even if any phase throws.

D-025 safety-net on ok ui

ok ui independently arms a 12-hour timer (DEFAULT_UI_SAFETY_NET_MS in packages/cli/src/commands/ui.ts) that self-terminates the UI if ok start ever crashes hard enough to skip the idle-shutdown SIGTERM. It is not a substitute for idle-shutdown — it's a backstop for silent crashes. handle.release() cancels the timer (idempotent); handle.detachSafetyNet() cancels only the timer without releasing the lock.