Pick the rig that runs the work
Bandwidth, not capacity, sets your tokens/sec. Match the machine to the model size you must load, then accept the speed its bandwidth allows.
▸ T2 is the enrollment floor — it runs the full curriculum end to end.
| tier | Apple | Mini-PC | DIY NVIDIA |
|---|---|---|---|
| T1 Entry | Mac mini M4 16–24GB 120 GB/s · $799–$999 | — | RTX 5060 Ti 16GB 448 GB/s · ~$1,400 |
| T2 Floor | MBP 14" M5 Pro 64GB 307 GB/s · ~$3,100 | Strix Halo 128GB (Beelink GTR9 Pro / GMKtec EVO-X2) ~180 GB/s eff · ~$1,985–$1,999 | used RTX 4090 24GB 1,008 GB/s · ~$2,700 |
| T3 Pro | Mac Studio M3 Ultra 96GB 819 GB/s · $3,999 | ASUS Ascent GX10 (GB10) 128GB 273 GB/s · ~$3,099 | RTX 5090 32GB 1,792 GB/s · ~$4,600 |
| T4 Research | MBP M5 Max 128GB 614 GB/s · ~$7,349 | 2× GB10 clustered → 256GB 273 GB/s · ~$6,200+ | dual RTX 3090 = 48GB 936 GB/s/card · ~$3,200 |
- Apple
- Mac mini M4 16–24GB120 GB/s · $799–$999
- Mini-PC
- —
- DIY NVIDIA
- RTX 5060 Ti 16GB448 GB/s · ~$1,400
- Apple
- MBP 14" M5 Pro 64GB307 GB/s · ~$3,100
- Mini-PC
- Strix Halo 128GB (Beelink GTR9 Pro / GMKtec EVO-X2)~180 GB/s eff · ~$1,985–$1,999
- DIY NVIDIA
- used RTX 4090 24GB1,008 GB/s · ~$2,700
- Apple
- Mac Studio M3 Ultra 96GB819 GB/s · $3,999
- Mini-PC
- ASUS Ascent GX10 (GB10) 128GB273 GB/s · ~$3,099
- DIY NVIDIA
- RTX 5090 32GB1,792 GB/s · ~$4,600
- Apple
- MBP M5 Max 128GB614 GB/s · ~$7,349
- Mini-PC
- 2× GB10 clustered → 256GB273 GB/s · ~$6,200+
- DIY NVIDIA
- dual RTX 3090 = 48GB936 GB/s/card · ~$3,200
What your box can actually run
Each model at its featured quant, against the four tiers: does it fit, and how fast does it decode? Speed is the bandwidth truth made literal — and it reads active params, which is why MoE models punch far above their size.
| model | T1 | T2 | T3 | T4 |
|---|---|---|---|---|
| # ── Gemma 4 ── | ||||
Gemma 4 E2B 2B · QAT UD-Q4_K_XL · 3 GB · multimodal · Apache-2.0 · source ↗ | ~120–448 tok/s | ~180–1008 tok/s | ~273–1792 tok/s | ~273–936 tok/s |
Gemma 4 E4B 4B · QAT UD-Q4_K_XL · 5 GB · multimodal · Apache-2.0 · source ↗ | ~60–224 tok/s | ~90–504 tok/s | ~137–896 tok/s | ~137–468 tok/s |
Gemma 4 12B 12B · QAT UD-Q4_K_XL · 7 GB · multimodal · Apache-2.0 · source ↗ | ~20–75 tok/s | ~30–168 tok/s | ~46–299 tok/s | ~46–156 tok/s |
Gemma 4 26B-A4B 26B (4B active) · QAT UD-Q4_K_XL · 15 GB · multimodal · Apache-2.0 · source ↗ | ~60–224 tok/s | ~90–504 tok/s | ~137–896 tok/s | ~137–468 tok/s |
Gemma 4 31B 31B · QAT UD-Q4_K_XL · 18 GB · multimodal · Apache-2.0 · source ↗ | — too big | ~12–65 tok/s | ~18–116 tok/s | ~18–60 tok/s |
| # ── Qwen3 ── | ||||
Qwen3 8B 8B · Q4_K_M · 6 GB · text · Apache-2.0 · source ↗ | ~30–112 tok/s | ~45–252 tok/s | ~68–448 tok/s | ~68–234 tok/s |
Qwen3 30B-A3B 30B (3B active) · Q4_K_M · 18 GB · text · Apache-2.0 · source ↗ | — too big | ~120–672 tok/s | ~182–1195 tok/s | ~182–624 tok/s |
| # ── Llama ── | ||||
Llama 8B 8B · Q4_K_M · 6 GB · text · Llama Community · source ↗ | ~30–112 tok/s | ~45–252 tok/s | ~68–448 tok/s | ~68–234 tok/s |
Llama 70B 70B · Q4_K_M · 42 GB · text · Llama Community · source ↗ | — too big | ~5–9 tok/s · 2/3 rigs | ~8–23 tok/s · 2/3 rigs | ~8–27 tok/s |
Llama 405B 405B · Q4_K_M · 230 GB · text · Llama Community · source ↗ | — too big | — too big | — too big | ~1 tok/s · 1/3 rigs |
| # ── DeepSeek ── | ||||
DeepSeek-V2-Lite 16B (2.4B active) · Q4_K_M · 10 GB · text · DeepSeek · source ↗ | ~100–373 tok/s | ~150–840 tok/s | ~228–1493 tok/s | ~228–780 tok/s |
| # ── OLMo 2 ── | ||||
OLMo 2 13B 13B · Q4_K_M · 8 GB · text · Apache-2.0 · source ↗ | ~18–69 tok/s | ~28–155 tok/s | ~42–276 tok/s | ~42–144 tok/s |
- T1
- ~120–448 tok/s
- T2
- ~180–1008 tok/s
- T3
- ~273–1792 tok/s
- T4
- ~273–936 tok/s
- T1
- ~60–224 tok/s
- T2
- ~90–504 tok/s
- T3
- ~137–896 tok/s
- T4
- ~137–468 tok/s
- T1
- ~20–75 tok/s
- T2
- ~30–168 tok/s
- T3
- ~46–299 tok/s
- T4
- ~46–156 tok/s
- T1
- ~60–224 tok/s
- T2
- ~90–504 tok/s
- T3
- ~137–896 tok/s
- T4
- ~137–468 tok/s
- T1
- — too big
- T2
- ~12–65 tok/s
- T3
- ~18–116 tok/s
- T4
- ~18–60 tok/s
- T1
- ~30–112 tok/s
- T2
- ~45–252 tok/s
- T3
- ~68–448 tok/s
- T4
- ~68–234 tok/s
- T1
- — too big
- T2
- ~120–672 tok/s
- T3
- ~182–1195 tok/s
- T4
- ~182–624 tok/s
- T1
- ~30–112 tok/s
- T2
- ~45–252 tok/s
- T3
- ~68–448 tok/s
- T4
- ~68–234 tok/s
- T1
- — too big
- T2
- ~5–9 tok/s · 2/3 rigs
- T3
- ~8–23 tok/s · 2/3 rigs
- T4
- ~8–27 tok/s
- T1
- — too big
- T2
- — too big
- T3
- — too big
- T4
- ~1 tok/s · 1/3 rigs
- T1
- ~100–373 tok/s
- T2
- ~150–840 tok/s
- T3
- ~228–1493 tok/s
- T4
- ~228–780 tok/s
- T1
- ~18–69 tok/s
- T2
- ~28–155 tok/s
- T3
- ~42–276 tok/s
- T4
- ~42–144 tok/s
▸ Why activeparams, not total, set speed — unpacked in the quantization & MoE depth-spike (coming with the curriculum).
The recommendation matrix
Cheapest 128 GB box in volume; 70B Q4 ~5–8 tok/s.
Windows-native, OCuLink eGPU escape hatch.
MLX just works for LoRA/QLoRA; 307 GB/s; ships now.
1,008 GB/s; cleanest CUDA path; best $/capability.
Cheapest GB10/DGX-class box; full CUDA, Linux-only.
819 GB/s — highest bandwidth shipping; 70B ~20–30 tok/s.
The bandwidth truth
Generating each token reads the active weights out of memory, so your tok/s ceiling ≈ bandwidth ÷ active-weight-bytes. Capacity decides whether a model fits; bandwidth decides how fast it runs once it fits — and cloud spot (A100 ~$4–8/hr) is the relief valve for the rare 70B full fine-tune.
Mid-2026 street prices; the 2026 RAM shortage adds volatility — re-verify at purchase.
Got the rig? Get the toolchain.
One command installs and version-pins your local AI stack. Then enroll on-chain and start module 01.
$ curl -fsSL mindpool.io/install | sh$ brew install --cask mindpool-labs/tap/mpl