Can a 96GB MacBook Pro really run ds4 plus DeepSeek V4 Flash?

Yes, but with caveats. The q2 quant weights take about 81GB; once the OS and Metal buffers are accounted for, less than 15GB is left for KV cache. The full 1M-token KV cache needs around 26GB, so on a 96GB machine the practical context window is roughly 100k tokens. The author recommends 128GB as the comfortable floor and a 512GB Mac Studio Ultra to unlock the full 1M-token context.

How does ds4 relate to llama.cpp, LM Studio or Ollama?

ds4 is a Metal inference engine dedicated to DeepSeek V4 Flash, not a generic GGUF runner. As of May 2026 neither llama.cpp nor LM Studio supports the V4 architecture, so on a Mac ds4 is essentially the only option for V4. Ollama works for DeepSeek R1 and older models but does not handle V4.

Why not just rent a Linux GPU cloud for DeepSeek V4 instead?

You can, but V4 Flash weights are 160GB and V4-Pro reaches 865GB, so on Linux you need high-memory H100, H200 or B200 nodes whose monthly cost typically exceeds an equivalent unified-memory Mac. Linux GPU clouds also lack Apple's UMA, KV-on-disk and native iOS toolchain, making a Mac VPS plus GPU cloud split a better long-term ROI than a pure GPU setup.

2026 antirez ds4 Runs DeepSeek V4 on Mac: 96/128/512GB Threshold, Metal Benchmarks and Mac VPS Decision

In May 2026 Redis creator antirez open-sourced ds4 (DwarfStar 4) and made DeepSeek V4 Flash run on a Mac at usable speeds for the first time, racking up 11K GitHub stars in days. But the 96GB entry, 128GB comfortable and 512GB Pro-capable memory tiers translate into a $4,500 to $15,000 Mac price tag, which lands hard on every independent developer. This guide is for the developers and small teams who are pulled in by ds4 yet refuse to ship code or private data to a third-party API: eight sections covering hardware thresholds, the Metal benchmark matrix, a three-way decision table, a reproducible Runbook and a FAQ, plus a Mac VPS plus DeepSeek V4 plus ds4 elastic compute combination.

1. What ds4 is

In May 2026 Redis creator antirez released ds4 (DwarfStar 4), a pure-C native inference engine purpose-built for DeepSeek V4 Flash, with the main path targeting Metal on macOS and CUDA on Linux. The author wrote it in a single week of fourteen-hour days, wiring V4 prompt rendering, KV state handling, OpenAI-style tool calling and an integrated coding agent into one self-contained binary. It crossed 11K GitHub stars in its first days. The "one model at a time" design is a deliberate bet: as of writing neither llama.cpp nor LM Studio supports V4, so on a Mac ds4 is the only realistic way to run it.

2. DeepSeek V4 Flash and V4-Pro specs in one table

DeepSeek shipped both V4 variants on 2026-04-24 under MIT, with a 1M-token context window:

Spec	V4 Flash	V4-Pro
Total parameters	284B (MoE)	1.6T (MoE)
Active per token	13B	49B
Context window	1,000,000 tokens	1,000,000 tokens
Max output	384,000 tokens	384,000 tokens
Weights on disk	~160 GB (FP4 + FP8 mixed)	~865 GB (FP4 + FP8 mixed)
License	MIT	MIT
Local viability	High-end consumer Mac	Only 512GB Mac Studio or multi-GPU server

Unlike V3 which split thinking and non-thinking models into separate IDs, V4 turns reasoning effort into a request parameter (non-thinking, thinking, max-thinking). The inference engine therefore loads one set of weights and reuses the KV cache across modes. Flash's 13B activated parameters are the key reason it runs on a Mac at all: after MoE routing each token costs roughly the same as a dense 13B model rather than a dense 30B.

3. Hardware threshold reality: 96/128/256/512GB

Many posts simply say "ds4 needs 96GB" and forget that KV cache also competes for memory. The real picture, combining the ds4 README and community testing:

Memory tier	Model	Quant	Context ceiling	Typical hardware	Reference price
96 GB	V4 Flash	q2	~100k tokens	MacBook Pro M3/M4 Max	$4,500+
128 GB	V4 Flash	q2 recommended	~250-300k tokens	MacBook Pro / Mac Studio Max	$5,500+
256 GB	V4 Flash	q4 high quality	500k+ tokens	Mac Studio M3/M4 Ultra	$8,500+
512 GB	V4 Flash + V4-Pro q2	q4 / q2-Pro	Near 1M tokens	Mac Studio M3 Ultra top spec	$15,000+

The q2 weights alone are 81GB, plus system RSS and Metal buffers leave less than 15GB on a 96GB box for KV. A full 1M-token KV cache needs about 26GB, so a 96GB machine realistically caps at roughly 100k context, with longer sessions paging or OOMing. 128GB is the no-brainer floor, and 512GB is the only configuration where V4 truly becomes production inference infrastructure.

4. Metal benchmark matrix

Official figures from the ds4 repository, covering short prompts and ~11K-12K-token long prompts:

Machine	Quant	Prompt length	Prefill	Generation
MacBook Pro M3 Max, 128GB	q2	short	58.52 t/s	26.68 t/s
MacBook Pro M3 Max, 128GB	q2	11,709 tokens	250.11 t/s	21.47 t/s
Mac Studio M3 Ultra, 512GB	q2	short	84.43 t/s	36.86 t/s
Mac Studio M3 Ultra, 512GB	q2	11,709 tokens	468.03 t/s	27.39 t/s
Mac Studio M3 Ultra, 512GB	q4	short	78.95 t/s	35.50 t/s
Mac Studio M3 Ultra, 512GB	q4	12,018 tokens	448.82 t/s	26.62 t/s
NVIDIA DGX Spark GB10, 128GB	q2	7,047 tokens	343.81 t/s	13.75 t/s

Three takeaways: long-prompt prefill on the Mac Studio M3 Ultra is nearly twice as fast as the MBP M3 Max, in line with UMA bandwidth; q2 and q4 generation on the Ultra are essentially tied (36.86 vs 35.50 t/s), so q4 buys you quality almost for free if memory permits; and the DGX Spark posts a strong prefill but only 13.75 t/s generation, half of the Ultra, suggesting the CUDA path is still maturing and Apple Silicon unexpectedly owns the consumer-grade V4 sweet spot in H1 2026.

5. Decision matrix: buy a top-spec Mac, rent a Mac VPS or rent GPU cloud

The one table that drives the decision:

Dimension	Buy top-spec Mac	Rent a Mac VPS	Linux GPU cloud (H100/H200)
Upfront cost	$4,500-$15,000	$0, monthly	$0, hourly
Monthly cost (128GB equivalent)	~$200-$350 depreciation	$200-$550 by tier	$2,000-$4,000 per H100
Run V4 Flash q2	Native Metal	Native Metal	CUDA branch needed
Run V4-Pro	Only on $15K 512GB top spec	Switch to a 512GB instance	Multi-GPU H200 / B200
Privacy boundary	Strongest, on-device	Strong, dedicated instance	Weaker, shared physical host
Elastic scaling	None, hardware locked	Up and down on demand	Extremely elastic hourly
iOS / macOS toolchain	Native	Native	Not supported
Retirement risk	50%+ depreciation in two years	None	None

The reading: if you run inference one or two hours a day, renting a Mac VPS is cheaper than buying outright. If you also need training or long fine-tuning runs, keep the Mac VPS as the control plane and push training onto a GPU cloud. The worst trap is the middle tier, paying $8K for a 256GB Mac Studio and discovering a year later that DeepSeek V5 or new quant standards have already shifted the optimal configuration faster than the hardware depreciates.

6. Why a Mac: UMA, Metal and the on-disk KV cache

Three reasons. First, Apple Silicon's unified memory architecture (UMA) lets the GPU directly address all 512GB on a Mac Studio without PCIe round-trips, a physical advantage no discrete GPU rig replicates: an RTX 5090's 32GB VRAM cannot hold V4 Flash's 160GB weights, even four 5090s cannot hold V4-Pro at q4, while a single Mac Studio M3 Ultra loads V4-Pro Q4 at 160-180W TDP. Second, the macOS NVMe SSD plus ds4's on-disk KV cache persists session context across runs, eliminating minutes of re-prefill that ephemeral GPU containers can rarely achieve without extra block storage and custom protocols. Third, the current macOS CPU path has a virtual-memory kernel bug that panics the host when running ds4 on CPU, which means only a Metal-capable high-memory Mac is a usable target.

7. Minimum reproducible Runbook: ds4 in five steps on a Mac VPS

End-to-end from zero to Cursor on a 128GB VPSMAC Mac VPS:

Step 1: clone and build Metal binaries. SSH into the Mac VPS, install Xcode Command Line Tools, then:

git clone https://github.com/antirez/ds4.git
cd ds4 && make    # produces ./ds4 and ./ds4-server

Step 2: download a V4 Flash q2 GGUF. Recommended community quants such as IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8 weigh in around 81GB; use aria2c -x 16 or huggingface-cli download in the background to avoid hogging the SSH session. Step 3: start ds4-server and verify on-disk KV:

./ds4-server -m ./ds4flash.gguf --ctx 128000 \
             --kv-disk ./kv-cache --port 8080
curl -s http://127.0.0.1:8080/v1/models

Step 4: connect Cursor, opencode or your own agent. ds4-server exposes an OpenAI-compatible /v1/chat/completions endpoint with Tool Calling; set the OpenAI API base in Cursor to http://your-mac-vps:8080/v1 and use ssh -L 8080:127.0.0.1:8080 user@mac-vps to keep the port on loopback rather than the public internet. Step 5: keep ds4-server alive with launchd. Drop a launchd plist into ~/Library/LaunchAgents/ with KeepAlive and stdout/stderr log paths, load it with launchctl load, and stream macOS logs to catch panics, ideally wired into your existing OpenClaw alerting.

8. Mac VPS plus ds4: the elastic local-inference combo

A common question is whether to skip Mac entirely and use Linux GPU cloud, Docker containers or a Windows AI PC for V4. Each of these has hard issues: Linux GPU clouds lack UMA, so V4 Flash needs H100 or H200 nodes whose monthly cost dwarfs the equivalent Mac Studio; Docker on macOS adds Apple Virtualization and IO abstraction overhead and noticeably reduces throughput; Windows with a 32GB RTX 5090 simply cannot host V4 Flash at all; and buying a Mac outright locks you to specific hardware and a steep two-year depreciation curve. When you want one SSH habit to manage ds4 inference, the iOS toolchain, an OpenClaw gateway, launchd daemons and remote GPU orchestration in a single place, renting an Apple Silicon Mac VPS from VPSMAC is usually the better answer: run ds4 on a dedicated 128/256/512GB instance, switch memory tiers on demand, and when you eventually need multi-GPU training delegate to CoreWeave, Lambda or RunPod (see the in-site CoreWeave decision matrix) with the Mac VPS still acting as the control plane. The combined TCO beats stacking everything onto a single GPU node.

9. FAQ

Can ds4 coexist with OpenClaw? Yes. ds4-server defaults to port 8080 and OpenClaw Gateway listens on 18789; they do not collide. Point OpenClaw's provider at ds4's OpenAI-compatible endpoint and your agent can call V4 locally without paying any third-party API. See the in-site OpenClaw v2026.5.20 upgrade Runbook.

Are the ROCm and CUDA branches usable today? The CUDA main branch supports DGX Spark (GB10) and generic CUDA GPUs via make cuda-spark or make cuda-generic; ROCm lives in a community-maintained branch that the author rebases without direct AMD hardware, so production users should prefer Metal or CUDA. When will llama.cpp or LM Studio support V4? Neither has merged V4 support as of May 2026; V4 uses custom DeepSeek ops and reasoning scheduling that take significant porting effort, likely several more months. Until then ds4 is essentially the only V4 engine on Mac. How do you avoid forgetting a rented instance running idle? Combine launchd with a small "alert if no active request for X hours" script, or configure ds4-server to exit after an idle timeout and pair it with the VPSMAC console hourly billing to auto-stop the instance.

10. Conclusion

antirez's ds4 turned "run DeepSeek V4 locally" from a theory into a workable engineering project, and the engineering boundary is the hardware threshold: 96GB is the entry ticket, 128GB the comfortable floor, 512GB the only no-compromise local inference target. Buying a top-spec Mac is both a five-figure check and a hidden depreciation bill two years later. Renting a Mac VPS flattens the curve, lets you spin up 128, 256 or 512GB instances on demand, upgrades V4 Flash to V4-Pro without swapping hardware, and pairs naturally with a GPU cloud for training, which is the most realistic 2026 path for ds4, local V4 and the Apple toolchain together.

2026 antirez ds4 Runs DeepSeek V4 on Mac: 96/128/512GB Memory Thresholds, Metal Benchmarks and a Buy vs Rent Mac VPS Decision Matrix

Table of contents