MEMORY or Docker memory first?

If you see Exit 137 or restart loops, inspect cgroup memory and disk. If the process stays up but latency grows with turns, inspect session context and MEMORY structure first.

How large is too large for MEMORY.md?

Beyond roughly 800 to 1200 lines without clear sections, retrieval fails; split, summarize, or externalize knowledge and keep weekly merges.

How does this relate to the observability article?

Observability focuses on gateway probes and JSONL ladders; this article focuses on memory and context files. Use both in the same runbook.

2026 OpenClaw MEMORY.md and Session Context Governance: Auditable Runbook for Mac Cloud 7×24

Once the gateway is green, teams still hit a wall: replies slow down, invoices climb, and the bot keeps re-asking decisions you thought were settled. That pattern usually points to unbounded session context and a MEMORY file that became an append-only junk drawer, not a weak model. This article names who gets burned, what you gain from disciplined layers, and delivers a symptom matrix, at least five operational steps aligned with Gateway logs, quotable thresholds, and FAQ hooks. Pair it with our OpenClaw observability and JSONL guide—that piece owns probes and ladders; this one owns memory and context economics.

1. Summary: silent context inflation

In 2026 a healthy openclaw doctor and an open port prove orchestration, not that every prompt stays lean. Each turn still concatenates chat history, tool payloads, and any injected long-term notes. When MEMORY.md grows without structure, retrieval noise beats real facts and latency tracks conversation depth more than provider status pages. Governance here is closer to product hygiene than classic uptime monitoring: you need ownership rules for what becomes long-term truth, how often facts merge, and which telemetry stays in JSONL instead of being pasted back into memory. The sections below separate common false positives, give a printable matrix for on-call, and finish with a weekly checklist that references the same time windows you already use for Gateway JSONL reviews.

Operators who skip this plane often oscillate between two failure modes: they either starve the agent of useful memory and get brittle answers, or they dump entire chat logs into MEMORY and wonder why every call feels expensive. A steady middle path—short durable facts, long structured headings, and aggressive trimming of ephemeral chatter—is what makes 7×24 assistants trustworthy for business workflows.

2. Pain points: four misreads

These stories repeat whenever an agent runs seven days a week on a Mac mini in the cloud or a small VPS:

Blaming the model first When the tenth reply in a thread is slow but the first was snappy, sample approximate injected context size from logs or your own counters before swapping endpoints.
Treating repetition as low IQ If policies live inside a thousand-line MEMORY blob without headings, the model may never stably surface them; restructure before tuning temperature.
Never compacting weekly notes Append-only MEMORY turns into archaeology. The failure is procedural, not a missing feature flag.
Confusing OOM with context debt Exit 137 and cgroup restarts point to memory limits; pure context bloat usually keeps the process alive while per-request latency balloons. Starting on the wrong plane burns hours.

                Rule of thumb Measure how fat the current turn is, then inspect long-term memory structure, then touch models or channels.
            

3. Matrix: memory vs resources vs gateway

Hang this beside the probe table from the observability article so shifts argue with data, not vibes.

Symptom	Primary plane	Fast evidence	Usually not the root cause
Slower each turn, fresh thread is fine	Session context	Compare first vs tenth turn latency; look for huge tool JSON echoed verbatim	Random provider slowdown
Cost up, answers short	Hidden long context / duplicate attachments	Correlate billing lines with log fields per request	Vendor price conspiracy
Breaks last week rules	MEMORY structure drift	Line count, heading integrity, stale sections	Model family regression
Process vanishes, container restarts	Resources	Exit codes, cgroup events, disk free space	Prompt edits
Channel silent, probe fails	Gateway and plugins	`gateway status`, channel probes, ladder from observability guide	MEMORY cleanup

Layering baseline

Keep at least two layers: durable facts that change rarely and deserve audit trails, and session preferences that can be discarded each sprint. Durable content needs stable headings; never let one paragraph hold fifty decisions. Session data should not auto-promote into durable memory without a human or scripted merge review. Cadence-wise, plan a fixed weekly merge window for durable notes and trigger session trims on iteration boundaries or size thresholds.

4. Five steps: weekly rhythm and log alignment

Walk through manually before you automate with launchd or cron on the Mac host:

Freeze baseline Record MEMORY.md line count, last modified time, and any config flags that affect context length. Drop the numbers into a ticket.
Weekly merge Fold new facts into the right sections, delete contradictions, ban untitled dumps.
Drift audit prompt Ask the agent to list three hard rules still in effect and compare with MEMORY; mismatches mark drift.
Align Gateway JSONL For the same window, tail structured logs using the order from the observability article. If rate limits or spawn anomalies are quiet yet latency is high, return to context sizing.
Backup before rewrite Snapshot MEMORY and critical workspace files to a dated folder; rollback is file restore plus gateway reload.

Minimal baseline capture:

#!/usr/bin/env bash
set -euo pipefail
test -f MEMORY.md && wc -l MEMORY.md | awk '{print "memory_lines",$1}'
date -r MEMORY.md "+%Y-%m-%d %H:%M" 2>/dev/null || stat -f "%Sm" MEMORY.md
openclaw status 2>/dev/null | head -n 20 || true

5. Metrics you can quote

Use these in design reviews or incidents, then tune for your scale. Also log the median and p95 size of tool responses you allow back into chat; teams that cap or summarize tool JSON often cut latency more than any model swap. When multiple operators edit MEMORY by hand, keep a short changelog header at the top of each weekly merge so you know which human last promoted a session note into durable facts.

Line count guardrail Beyond roughly eight hundred to twelve hundred unstructured lines, humans stop finding anything; split chapters or move to an external knowledge base.
Calendar time Budget thirty to forty-five minutes every week for MEMORY hygiene instead of quarterly panic days.
Latency ratio Under the same model and channel, if turn ten p95 exceeds turn one by about two to three times, inspect duplicated tool payloads before blaming the network.
Disk headroom JSONL, backups, and MEMORY archives sharing a volume still want roughly ten to fifteen gigabytes free on Mac cloud nodes to avoid jitter while logging.
Exit 137 signal Treat it as cgroup memory until disproven; context-only issues rarely end with 137.
Escalation order Resources, then gateway probes, then memory governance—reversing the order creates circular debugging.

6. Why Mac cloud fits the memory plane

Noisy-neighbor VPS disks can mimic context storms because occasional read latency spikes feel like huge prompts. Windows remote desktops and consumer laptops add session sleep and graphics stacks that fight unattended agents. Docker adds another abstraction layer where volume mounts and uid mapping quietly desync the MEMORY path you think you edited. A dedicated Mac cloud machine behaves like a disciplined SSH server: predictable paths for logs, launchd jobs, and nightly archives, co-located with the Apple toolchain articles you already rely on for OpenClaw. Containers and generic VPS are fine for experiments, but when memory governance becomes production work, you want IO and ownership you can reason about—exactly what a leased Mac node from VPSMAC is meant to provide before you spend another week tuning prompts on shaky infrastructure.

Finally, treat MEMORY governance as part of cost governance: the same weekly review that trims files can include a five-minute glance at token dashboards so finance and engineering share one narrative. When both sides agree which metrics matter, you stop oscillating between unlimited context and emergency hard resets that confuse users mid-conversation.