$ df -h ./hardware

Pick the rig that runs the work

Bandwidth, not capacity, sets your tokens/sec. Match the machine to the model size you must load, then accept the speed its bandwidth allows.

T2 is the enrollment floor — it runs the full curriculum end to end.

~/mindpool/hardware
T1Entry
Apple
Mac mini M4 16–24GB
120 GB/s · $799–$999
Mini-PC
DIY NVIDIA
RTX 5060 Ti 16GB
448 GB/s · ~$1,400
T2Floor
Apple
MBP 14" M5 Pro 64GB
307 GB/s · ~$3,100
Mini-PC
Strix Halo 128GB (Beelink GTR9 Pro / GMKtec EVO-X2)
~180 GB/s eff · ~$1,985–$1,999
DIY NVIDIA
used RTX 4090 24GB
1,008 GB/s · ~$2,700
T3Pro
Apple
Mac Studio M3 Ultra 96GB
819 GB/s · $3,999
Mini-PC
ASUS Ascent GX10 (GB10) 128GB
273 GB/s · ~$3,099
DIY NVIDIA
RTX 5090 32GB
1,792 GB/s · ~$4,600
T4Research
Apple
MBP M5 Max 128GB
614 GB/s · ~$7,349
Mini-PC
2× GB10 clustered → 256GB
273 GB/s · ~$6,200+
DIY NVIDIA
dual RTX 3090 = 48GB
936 GB/s/card · ~$3,200
$ ls ./models --by-tier

What your box can actually run

Each model at its featured quant, against the four tiers: does it fit, and how fast does it decode? Speed is the bandwidth truth made literal — and it reads active params, which is why MoE models punch far above their size.

~/mindpool/models
# ── Gemma 4 ──
Gemma 4 E2B
2B · QAT UD-Q4_K_XL · 3 GB · multimodal · Apache-2.0 · source ↗
T1
~120–448 tok/s
T2
~180–1008 tok/s
T3
~273–1792 tok/s
T4
~273–936 tok/s
Gemma 4 E4B
4B · QAT UD-Q4_K_XL · 5 GB · multimodal · Apache-2.0 · source ↗
T1
~60–224 tok/s
T2
~90–504 tok/s
T3
~137–896 tok/s
T4
~137–468 tok/s
Gemma 4 12B
12B · QAT UD-Q4_K_XL · 7 GB · multimodal · Apache-2.0 · source ↗
T1
~20–75 tok/s
T2
~30–168 tok/s
T3
~46–299 tok/s
T4
~46–156 tok/s
Gemma 4 26B-A4B
26B (4B active) · QAT UD-Q4_K_XL · 15 GB · multimodal · Apache-2.0 · source ↗
T1
~60–224 tok/s
T2
~90–504 tok/s
T3
~137–896 tok/s
T4
~137–468 tok/s
Gemma 4 31B
31B · QAT UD-Q4_K_XL · 18 GB · multimodal · Apache-2.0 · source ↗
T1
— too big
T2
~12–65 tok/s
T3
~18–116 tok/s
T4
~18–60 tok/s
# ── Qwen3 ──
Qwen3 8B
8B · Q4_K_M · 6 GB · text · Apache-2.0 · source ↗
T1
~30–112 tok/s
T2
~45–252 tok/s
T3
~68–448 tok/s
T4
~68–234 tok/s
Qwen3 30B-A3B
30B (3B active) · Q4_K_M · 18 GB · text · Apache-2.0 · source ↗
T1
— too big
T2
~120–672 tok/s
T3
~182–1195 tok/s
T4
~182–624 tok/s
# ── Llama ──
Llama 8B
8B · Q4_K_M · 6 GB · text · Llama Community · source ↗
T1
~30–112 tok/s
T2
~45–252 tok/s
T3
~68–448 tok/s
T4
~68–234 tok/s
Llama 70B
70B · Q4_K_M · 42 GB · text · Llama Community · source ↗
T1
— too big
T2
~5–9 tok/s · 2/3 rigs
T3
~8–23 tok/s · 2/3 rigs
T4
~8–27 tok/s
Llama 405B
405B · Q4_K_M · 230 GB · text · Llama Community · source ↗
T1
— too big
T2
— too big
T3
— too big
T4
~1 tok/s · 1/3 rigs
# ── DeepSeek ──
DeepSeek-V2-Lite
16B (2.4B active) · Q4_K_M · 10 GB · text · DeepSeek · source ↗
T1
~100–373 tok/s
T2
~150–840 tok/s
T3
~228–1493 tok/s
T4
~228–780 tok/s
# ── OLMo 2 ──
OLMo 2 13B
13B · Q4_K_M · 8 GB · text · Apache-2.0 · source ↗
T1
~18–69 tok/s
T2
~28–155 tok/s
T3
~42–276 tok/s
T4
~42–144 tok/s
formula tok/s ≈ bandwidth ÷ (active-params × bytes/param)
Estimated decode ceiling — real-world is lower (attention, KV, overhead) and this ignores prefill. Uses active params, so MoE reads fast-and-small.
as of 2026-06 — curated, not exhaustive; models churn, re-verify each footprint at its source.

Why activeparams, not total, set speed — unpacked in the quantization & MoE depth-spike (coming with the curriculum).

$ cat ./if-you-are-X-buy-Y

The recommendation matrix

Tightest budget, want 70B in memory, OK with Linux + ROCm
Beelink GTR9 Pro (Strix Halo, 128 GB) · ~$1,985

Cheapest 128 GB box in volume; 70B Q4 ~5–8 tok/s.

Want a turnkey Windows 128 GB box, no building
GMKtec EVO-X2 (Strix Halo) · ~$1,999

Windows-native, OCuLink eGPU escape hatch.

Smoothest experience, value macOS + portability, models ≤34B
MacBook Pro 14" M5 Pro, 64 GB · ~$3,100

MLX just works for LoRA/QLoRA; 307 GB/s; ships now.

Fine-tuning-heavy, want fast tokens + GPU upgrade path
DIY used RTX 4090 24 GB · ~$2,700

1,008 GB/s; cleanest CUDA path; best $/capability.

Want NVIDIA's stack + 70B QLoRA in one mini box
ASUS Ascent GX10 (GB10, 128 GB) · ~$3,099

Cheapest GB10/DGX-class box; full CUDA, Linux-only.

Max memory + fast tokens, single vendor desk box
Mac Studio M3 Ultra, 96 GB · $3,999

819 GB/s — highest bandwidth shipping; 70B ~20–30 tok/s.

$ man bandwidth

The bandwidth truth

Generating each token reads the active weights out of memory, so your tok/s ceiling ≈ bandwidth ÷ active-weight-bytes. Capacity decides whether a model fits; bandwidth decides how fast it runs once it fits — and cloud spot (A100 ~$4–8/hr) is the relief valve for the rare 70B full fine-tune.

Mid-2026 street prices; the 2026 RAM shortage adds volatility — re-verify at purchase.

Got the rig? Get the toolchain.

One command installs and version-pins your local AI stack. Then enroll on-chain and start module 01.

// get the toolchain
$ curl -fsSL mindpool.io/install | sh
macOS · Linux · Windows (WSL2)
// or, on macOS (Apple Silicon)
$ brew install --cask mindpool-labs/tap/mpl