VLM Latency on Apple Silicon: A Field Report on Tiers, Tokens, and One Sketchy Binary
The Problem Statement
We process iPhone photos through a vision language model (VLM) on a Mac Studio M3 Ultra (96GB RAM). The goal: extract structured JSON — people, places, objects, relationships, wardrobe, mood — for each photo in a user’s library. Entirely local. Zero cloud.
The latency was 180 seconds per photo.
This post documents what we tried, what the token math actually looks like, which prefix caching servers work on Apple Silicon (spoiler: none of the new ones), and how a free inference server nearly made it onto our inference stack before a binary audit caught something worth flagging.
Token Math First
Before reaching for a faster model, we profiled where time was actually going. A single iPhone photo request breaks into two phases:
Prefill phase (image encoding + prompt tokenization):
| Component | Tokens | Notes |
|---|---|---|
| iPhone photo (4032×3024, max_pixels=12,845,056) | ~15,500 | mlx_vlm preprocessor, nearly every pixel |
| System prompt (extraction instructions) | ~2,500 | structured JSON schema + wardrobe examples |
| User prompt | ~180 | task framing |
| Total prefill | ~18,180 |
Generation phase:
| Model | Generation speed | 800-token JSON response |
|---|---|---|
| Qwen2.5-VL-72B-4bit | ~10 tok/s | ~80s |
| Qwen2.5-VL-32B-4bit | ~14 tok/s | ~57s |
| Qwen2.5-VL-3B-4bit | ~120 tok/s | ~7s (fast prompt, ~300 tokens) |
The 72B model was spending ~180s total: roughly 100s on prefill (18K tokens ÷ ~100 tok/s prefill speed on M2 Ultra) and 80s on generation. Image size was not the dominant issue — generation speed on a 72B model was.
The system prompt is 2,500 tokens and gets re-encoded on every single request. Prefix caching would eliminate this cost entirely across repeated requests. At 18K total prefill tokens, a 2,500-token cached prefix saves ~14% of prefill time. Small now, meaningful at scale.
Experiment A: Model Swap (72B → 32B)
Hypothesis: Qwen2.5-VL-32B-4bit scores comparably to 72B on our specific extraction task while being meaningfully faster.
Background research finding (before running our own eval): On public benchmarks, 32B actually outperforms 72B on MMMU (70.0 vs 64.2) and is close on DocVQA (93.7 vs 96.4). We were paying for 72B quality we weren’t getting.
Test setup: 5 representative photos, same prompt (v7), same scoring rubric (LLM judge with 5 dimensions), gateway timeout 240s.
| Model | Avg Score | Avg Latency | RAM Required | Winner (per photo) |
|---|---|---|---|---|
| Qwen2.5-VL-72B-4bit | 75.5/100 | 86s | ~46GB | 3/5 photos |
| Qwen2.5-VL-32B-4bit | 74.4/100 | 60s | ~24GB | 2/5 photos |
| Delta | -1.1 pts (-1.5%) | -30% faster | -22GB freed |
Decision (D035): 32B is the default model for L2 extraction. Do not revert to 72B without new eval data showing >3 point regression.
Experiment B: 3B Fast Tier (L1.5)
The bigger insight from the token math: you don’t need 800-token structured JSON for the user’s first impression of a photo. You need something useful, fast.
Hypothesis: A 3B model with a lightweight prompt (~500 tokens vs ~2,500) can deliver entity + context + relationship extraction in ~5 seconds — enough to populate the review deck while 32B runs async in the background.
Fast prompt design:
- Removed wardrobe extraction (complex, high-token output)
- Removed inferences and mood fields
- Removed multi-person wardrobe examples from the prompt
- Output target:
entities,context,relationshipsonly - ~300 token expected output vs ~800 for full extraction
Results (Qwen2.5-VL-3B-4bit, :8086, warm model):
| Metric | Value |
|---|---|
| Avg entities per photo | 4.2 |
| Avg latency (warm) | 1.0s |
| JSON valid | 100% (5/5 test photos) |
| Extraction completeness vs 32B | ~55% (expected — limited field set) |
Time to first useful result: 1.0 second.
The 4.2 entities per photo is enough. “7 people at a dinner table, evening, indoor” appears in ~1 second. The full wardrobe + relationship + inference extraction follows async over 60 seconds. Users see progressive enrichment, not a loading spinner.
The tiered architecture that resulted:
L0: Import (0s)
→ metadata only: timestamp, GPS, file size
→ immediately available in UI
L1: On-device Vision (0.5s)
→ Apple Vision Framework: faces, scenes, text, GPS refinement
→ review deck shows immediately
L1.5: 3B VLM — Qwen2.5-VL-3B-4bit on :8086 (1.0s)
→ entities, context, basic relationships
→ review deck enriches in real-time
L2: 32B VLM — Qwen2.5-VL-32B-4bit on :8080 (60s)
→ wardrobe, deep relationships, inferences
→ writes to DB async, UI updates silently
Before: 180 seconds to see anything. After: 1 second to see something useful, 60 seconds to see everything.
Prefix Caching: Three Experiments, Zero Winners
The 2,500-token system prompt is a constant. Every photo request re-encodes it from scratch. Prefix caching — where the KV cache for a shared prompt prefix is computed once and reused — should eliminate this cost.
We evaluated three servers that claim prefix caching support for VLMs on Apple Silicon.
vllm-mlx v0.2.6
Expected behavior: Drop-in replacement for mlx_vlm.server with prefix caching for repeated system prompts.
Actual behavior:
IndexError: index 1 is out of bounds for dimension 0 with size 1
File "vllm_mlx/engine/scheduler.py", line 847, in _schedule_mllm_requests
Both 32B and 3B fail. The MLLM scheduler (multimodal LLM scheduler) has a known bug in v0.2.6 specific to Qwen2.5-VL’s vision token layout. The vision tokens from the preprocessor produce a dimension-1 tensor where the scheduler expects dimension-0. Not a configuration issue — a framework bug.
Filed as a known issue in the vllm-mlx tracker. The fix is in the MLLM scheduler’s dimension handling for multi-image inputs. Monitoring the repo for the Qwen2.5-VL-specific patch — no timeline commitment from maintainers yet.
Result: FAIL. Do not use.
oMLX v0.2.19
Expected behavior: Alternative MLX serving framework with prefix caching and vision model support.
Actual behavior: Works initially, then degrades catastrophically under cache pressure.
| Request | Latency | Notes |
|---|---|---|
| Photo 1 (portrait) — warm | 7s | Promising |
| Photo 2 (same prompt, different image) | 12s | Cache miss, recomputing |
| Photo 3 (landscape, different prompt structure) | 72s | Cache pressure regression |
The cache hit rate drops to near-zero when consecutive requests have different image dimensions or prompt structures — which is every real user workflow. Different photos have different vision token counts. The cache fills, evicts, and the overhead of cache management makes it slower than stock mlx_vlm.server.
Result: FAIL. Worse than baseline under real workload.
Bodega Sensors Pro
This one has a longer story.
Bodega is a GUI app store for macOS that also bundles inference server functionality. It appeared in our research as a potential prefix caching server. The agent installed it.
We ran a static binary analysis before enabling the inference engine:
strings /Applications/Bodega\ Sensors\ Pro.app/Contents/MacOS/Bodega\ Sensors\ Pro \
| grep -iE "telemetry|analytics|tracking|sentry|amplitude|mixpanel|ipapi|segment"
Results:
| Match | Keyword | Context |
|---|---|---|
| 1 | ipapi.co | IP geolocation API — external network call |
| 2-7 | telemetry, analytics, tracking (×3), sentry, amplitude | Could be dependency variable names or actual telemetry |
The ipapi.co reference is a live URL for an IP geolocation service. An inference server calling an external IP geolocation API is a hard no for our threat model. We process user photos locally specifically to avoid this class of data exposure.
Decision: Inference engine remains disabled. Bodega is classified as “pending security audit” in our tool evaluation framework. The static analysis findings go into RRL/SECURITY_FRAMEWORK.md alongside the 3-phase tool evaluation checklist we built the same afternoon this came up.
The correct audit procedure: run security-audit-inference.sh on Bodega during a controlled test inference — monitor all outbound network connections for 60 seconds while sending one synthetic request. If ipapi.co or any external host receives traffic, the tool is disqualified. If traffic is zero and the 7 keyword matches are confirmed as dependency artifacts (variable names, not live telemetry calls), reconsider.
This wasn’t the story we expected to write about prefix caching. But it’s the more important one.
Result: UNRESOLVED. Audit required before any production use.
What the Prefix Caching Ecosystem Looks Like Right Now
For anyone else running Qwen2.5-VL on Apple Silicon and wondering if prefix caching is worth investigating:
| Server | Qwen2.5-VL Support | Status | Notes |
|---|---|---|---|
| mlx_vlm.server | ✅ Full | Stable | No prefix caching. This is your baseline. |
| vllm-mlx v0.2.6 | ❌ Broken | MLLM scheduler bug | index 1 is out of bounds |
| oMLX v0.2.19 | ⚠️ Partial | Unstable | Works warm, degrades under cache pressure |
| Bodega Sensors Pro | ❓ Unknown | Security audit pending | Static analysis flagged external network references |
The path forward is monitoring vllm-mlx for the Qwen2.5-VL MLLM scheduler fix. Based on open PR activity as of March 2026, a patch looks plausible in the near term — but that’s reading GitHub tea leaves, not a commitment.
For production VLM inference on Apple Silicon today: stick with mlx_vlm.server. The alternatives are either broken or introduce stability regressions. Check back in April.
What Actually Shipped
Beyond the model swap and L1.5 tier, two additional latency improvements shipped in the same session:
1. enqueueForReview moved before L2 extraction
Previously, photos entered the review deck only after full 32B extraction completed (~60s). The operator noticed this meant users stared at a loading state for a minute before seeing anything.
Fix: enqueueForReview() now fires after L1 (Apple Vision) completes, before L2 starts. Review deck shows within ~1s of photo import. L2 tags appear silently as they arrive.
2. resize_shape parameter added to DeepExtractor (unvalidated)
Added resize_shape=(1024, 1024) to the mlx_vlm.generate call for L2. This would cap image encoding to ~2,600 vision tokens instead of ~15,500 for full-resolution iPhone photos — but our eval corpus is 800px photos, below the resize threshold, so we have no measured result. The code is in, the claim isn’t. See Open Questions.
Latency Before and After
| Scenario | Before | After | Method |
|---|---|---|---|
| Time to first result (any tags) | 180s | 0.5s | L1 on-device (unchanged) |
| Time to useful semantic tags | 180s | 1.0s | L1.5 tier (3B model) |
| Time to full extraction | 180s | 60s | 32B replacing 72B |
| RAM headroom after model loads | ~10GB | ~32GB | 22GB freed by model swap |
| Cold-start false DOWN detection | ~40% false alarm rate | 0% | Process-aware health checks |
Learnings
L1: Token math before model shopping. We almost spent time testing a dozen different models before profiling where time was going. Generation speed is multiplicative with model size — a 3B model at 120 tok/s will always beat a 72B at 10 tok/s on long-output tasks, regardless of what the benchmarks say about quality.
L2: The user needs “useful” in seconds, not “complete.” This reframing unlocked the tiered architecture. A 32B model producing 800 tokens of wardrobe and mood data in 60 seconds is exactly right — but only if the user sees something in 1 second. The tier structure isn’t a performance optimization. It’s a UX insight encoded as infrastructure.
L3: Prefix caching for VLMs on Apple Silicon isn’t ready. vllm-mlx has a framework bug specific to multimodal models. oMLX has a cache pressure regression under real workloads. Do not use either in production today.
L4: Static binary analysis before running third-party inference servers. An inference server that phones home — even for something as benign as IP geolocation — runs adjacent to user photos. Our threat model doesn’t allow that, even if the data being sent isn’t the photos themselves. The strings pipe takes 10 seconds. Run it before launchctl load.
L5: resize_shape is unmeasured debt. We added the parameter but can’t validate it without real phone-resolution photos in the eval corpus. The 20-photo corpus is all 800px — below the 1024px resize threshold. Every optimization claim about image resizing is a hypothesis, not a result, until the corpus has real phone photos.
The metadata enrichment pipeline is the next high-leverage latency improvement that doesn’t depend on model speed. If 47 photos show the same shirt, the VLM doesn’t need to identify “Nike” from each one — cluster by CLIP embedding, crop the logo region, anonymize (strip EXIF, blur faces), and do one web lookup that propagates to all 47. This changes the question from “can we make the VLM faster?” to “can we call the VLM less?”
Open Questions
- Does
resize_shape=(1024, 1024)reduce 32B prefill latency from ~100s to ~20s on real phone photos? We have the code but not the data. - Can the 3B model be fine-tuned on 32B extraction output to close the quality gap? (LoRA + MLX’s native fine-tuning infrastructure — backend/training/ already exists.)
- When does vllm-mlx ship the Qwen2.5-VL MLLM scheduler fix?
- What is Bodega actually calling ipapi.co for, and when?