VLM Latency on Apple Silicon: A Field Report on Tiers, Tokens, and One Sketchy Binary

mlxvlmlatencyapple-siliconon-device-aisecuritylocal-inference

Session transcript Annotated

The Problem Statement

We process iPhone photos through a vision language model (VLM) on a Mac Studio M3 Ultra (96GB RAM). The goal: extract structured JSON — people, places, objects, relationships, wardrobe, mood — for each photo in a user’s library. Entirely local. Zero cloud.

The latency was 180 seconds per photo.

This post documents what we tried, what the token math actually looks like, which prefix caching servers work on Apple Silicon (spoiler: none of the new ones), and how a free inference server nearly made it onto our inference stack before a binary audit caught something worth flagging.


Token Math First

Before reaching for a faster model, we profiled where time was actually going. A single iPhone photo request breaks into two phases:

Prefill phase (image encoding + prompt tokenization):

ComponentTokensNotes
iPhone photo (4032×3024, max_pixels=12,845,056)~15,500mlx_vlm preprocessor, nearly every pixel
System prompt (extraction instructions)~2,500structured JSON schema + wardrobe examples
User prompt~180task framing
Total prefill~18,180

Generation phase:

ModelGeneration speed800-token JSON response
Qwen2.5-VL-72B-4bit~10 tok/s~80s
Qwen2.5-VL-32B-4bit~14 tok/s~57s
Qwen2.5-VL-3B-4bit~120 tok/s~7s (fast prompt, ~300 tokens)

The 72B model was spending ~180s total: roughly 100s on prefill (18K tokens ÷ ~100 tok/s prefill speed on M2 Ultra) and 80s on generation. Image size was not the dominant issue — generation speed on a 72B model was.

? claudehigh confidence

The system prompt is 2,500 tokens and gets re-encoded on every single request. Prefix caching would eliminate this cost entirely across repeated requests. At 18K total prefill tokens, a 2,500-token cached prefix saves ~14% of prefill time. Small now, meaningful at scale.


Experiment A: Model Swap (72B → 32B)

Hypothesis: Qwen2.5-VL-32B-4bit scores comparably to 72B on our specific extraction task while being meaningfully faster.

Background research finding (before running our own eval): On public benchmarks, 32B actually outperforms 72B on MMMU (70.0 vs 64.2) and is close on DocVQA (93.7 vs 96.4). We were paying for 72B quality we weren’t getting.

Test setup: 5 representative photos, same prompt (v7), same scoring rubric (LLM judge with 5 dimensions), gateway timeout 240s.

ModelAvg ScoreAvg LatencyRAM RequiredWinner (per photo)
Qwen2.5-VL-72B-4bit75.5/10086s~46GB3/5 photos
Qwen2.5-VL-32B-4bit74.4/10060s~24GB2/5 photos
Delta-1.1 pts (-1.5%)-30% faster-22GB freed

Decision (D035): 32B is the default model for L2 extraction. Do not revert to 72B without new eval data showing >3 point regression.


Experiment B: 3B Fast Tier (L1.5)

The bigger insight from the token math: you don’t need 800-token structured JSON for the user’s first impression of a photo. You need something useful, fast.

Hypothesis: A 3B model with a lightweight prompt (~500 tokens vs ~2,500) can deliver entity + context + relationship extraction in ~5 seconds — enough to populate the review deck while 32B runs async in the background.

Fast prompt design:

  • Removed wardrobe extraction (complex, high-token output)
  • Removed inferences and mood fields
  • Removed multi-person wardrobe examples from the prompt
  • Output target: entities, context, relationships only
  • ~300 token expected output vs ~800 for full extraction

Results (Qwen2.5-VL-3B-4bit, :8086, warm model):

MetricValue
Avg entities per photo4.2
Avg latency (warm)1.0s
JSON valid100% (5/5 test photos)
Extraction completeness vs 32B~55% (expected — limited field set)

Time to first useful result: 1.0 second.

The 4.2 entities per photo is enough. “7 people at a dinner table, evening, indoor” appears in ~1 second. The full wardrobe + relationship + inference extraction follows async over 60 seconds. Users see progressive enrichment, not a loading spinner.

The tiered architecture that resulted:

L0: Import (0s)
  → metadata only: timestamp, GPS, file size
  → immediately available in UI

L1: On-device Vision (0.5s)
  → Apple Vision Framework: faces, scenes, text, GPS refinement
  → review deck shows immediately

L1.5: 3B VLM — Qwen2.5-VL-3B-4bit on :8086 (1.0s)
  → entities, context, basic relationships
  → review deck enriches in real-time

L2: 32B VLM — Qwen2.5-VL-32B-4bit on :8080 (60s)
  → wardrobe, deep relationships, inferences
  → writes to DB async, UI updates silently

Before: 180 seconds to see anything. After: 1 second to see something useful, 60 seconds to see everything.


Prefix Caching: Three Experiments, Zero Winners

The 2,500-token system prompt is a constant. Every photo request re-encodes it from scratch. Prefix caching — where the KV cache for a shared prompt prefix is computed once and reused — should eliminate this cost.

We evaluated three servers that claim prefix caching support for VLMs on Apple Silicon.

vllm-mlx v0.2.6

Expected behavior: Drop-in replacement for mlx_vlm.server with prefix caching for repeated system prompts.

Actual behavior:

IndexError: index 1 is out of bounds for dimension 0 with size 1
  File "vllm_mlx/engine/scheduler.py", line 847, in _schedule_mllm_requests

Both 32B and 3B fail. The MLLM scheduler (multimodal LLM scheduler) has a known bug in v0.2.6 specific to Qwen2.5-VL’s vision token layout. The vision tokens from the preprocessor produce a dimension-1 tensor where the scheduler expects dimension-0. Not a configuration issue — a framework bug.

? claudemedium confidence

Filed as a known issue in the vllm-mlx tracker. The fix is in the MLLM scheduler’s dimension handling for multi-image inputs. Monitoring the repo for the Qwen2.5-VL-specific patch — no timeline commitment from maintainers yet.

Result: FAIL. Do not use.


oMLX v0.2.19

Expected behavior: Alternative MLX serving framework with prefix caching and vision model support.

Actual behavior: Works initially, then degrades catastrophically under cache pressure.

RequestLatencyNotes
Photo 1 (portrait) — warm7sPromising
Photo 2 (same prompt, different image)12sCache miss, recomputing
Photo 3 (landscape, different prompt structure)72sCache pressure regression

The cache hit rate drops to near-zero when consecutive requests have different image dimensions or prompt structures — which is every real user workflow. Different photos have different vision token counts. The cache fills, evicts, and the overhead of cache management makes it slower than stock mlx_vlm.server.

Result: FAIL. Worse than baseline under real workload.


Bodega Sensors Pro

This one has a longer story.

Bodega is a GUI app store for macOS that also bundles inference server functionality. It appeared in our research as a potential prefix caching server. The agent installed it.

We ran a static binary analysis before enabling the inference engine:

strings /Applications/Bodega\ Sensors\ Pro.app/Contents/MacOS/Bodega\ Sensors\ Pro \
  | grep -iE "telemetry|analytics|tracking|sentry|amplitude|mixpanel|ipapi|segment"

Results:

MatchKeywordContext
1ipapi.coIP geolocation API — external network call
2-7telemetry, analytics, tracking (×3), sentry, amplitudeCould be dependency variable names or actual telemetry

The ipapi.co reference is a live URL for an IP geolocation service. An inference server calling an external IP geolocation API is a hard no for our threat model. We process user photos locally specifically to avoid this class of data exposure.

Decision: Inference engine remains disabled. Bodega is classified as “pending security audit” in our tool evaluation framework. The static analysis findings go into RRL/SECURITY_FRAMEWORK.md alongside the 3-phase tool evaluation checklist we built the same afternoon this came up.

? claudehigh confidence

The correct audit procedure: run security-audit-inference.sh on Bodega during a controlled test inference — monitor all outbound network connections for 60 seconds while sending one synthetic request. If ipapi.co or any external host receives traffic, the tool is disqualified. If traffic is zero and the 7 keyword matches are confirmed as dependency artifacts (variable names, not live telemetry calls), reconsider.

This wasn’t the story we expected to write about prefix caching. But it’s the more important one.

Result: UNRESOLVED. Audit required before any production use.


What the Prefix Caching Ecosystem Looks Like Right Now

For anyone else running Qwen2.5-VL on Apple Silicon and wondering if prefix caching is worth investigating:

ServerQwen2.5-VL SupportStatusNotes
mlx_vlm.server✅ FullStableNo prefix caching. This is your baseline.
vllm-mlx v0.2.6❌ BrokenMLLM scheduler bugindex 1 is out of bounds
oMLX v0.2.19⚠️ PartialUnstableWorks warm, degrades under cache pressure
Bodega Sensors Pro❓ UnknownSecurity audit pendingStatic analysis flagged external network references

The path forward is monitoring vllm-mlx for the Qwen2.5-VL MLLM scheduler fix. Based on open PR activity as of March 2026, a patch looks plausible in the near term — but that’s reading GitHub tea leaves, not a commitment.

For production VLM inference on Apple Silicon today: stick with mlx_vlm.server. The alternatives are either broken or introduce stability regressions. Check back in April.


What Actually Shipped

Beyond the model swap and L1.5 tier, two additional latency improvements shipped in the same session:

1. enqueueForReview moved before L2 extraction

Previously, photos entered the review deck only after full 32B extraction completed (~60s). The operator noticed this meant users stared at a loading state for a minute before seeing anything.

Fix: enqueueForReview() now fires after L1 (Apple Vision) completes, before L2 starts. Review deck shows within ~1s of photo import. L2 tags appear silently as they arrive.

2. resize_shape parameter added to DeepExtractor (unvalidated)

Added resize_shape=(1024, 1024) to the mlx_vlm.generate call for L2. This would cap image encoding to ~2,600 vision tokens instead of ~15,500 for full-resolution iPhone photos — but our eval corpus is 800px photos, below the resize threshold, so we have no measured result. The code is in, the claim isn’t. See Open Questions.


Latency Before and After

ScenarioBeforeAfterMethod
Time to first result (any tags)180s0.5sL1 on-device (unchanged)
Time to useful semantic tags180s1.0sL1.5 tier (3B model)
Time to full extraction180s60s32B replacing 72B
RAM headroom after model loads~10GB~32GB22GB freed by model swap
Cold-start false DOWN detection~40% false alarm rate0%Process-aware health checks

Learnings

L1: Token math before model shopping. We almost spent time testing a dozen different models before profiling where time was going. Generation speed is multiplicative with model size — a 3B model at 120 tok/s will always beat a 72B at 10 tok/s on long-output tasks, regardless of what the benchmarks say about quality.

L2: The user needs “useful” in seconds, not “complete.” This reframing unlocked the tiered architecture. A 32B model producing 800 tokens of wardrobe and mood data in 60 seconds is exactly right — but only if the user sees something in 1 second. The tier structure isn’t a performance optimization. It’s a UX insight encoded as infrastructure.

L3: Prefix caching for VLMs on Apple Silicon isn’t ready. vllm-mlx has a framework bug specific to multimodal models. oMLX has a cache pressure regression under real workloads. Do not use either in production today.

L4: Static binary analysis before running third-party inference servers. An inference server that phones home — even for something as benign as IP geolocation — runs adjacent to user photos. Our threat model doesn’t allow that, even if the data being sent isn’t the photos themselves. The strings pipe takes 10 seconds. Run it before launchctl load.

L5: resize_shape is unmeasured debt. We added the parameter but can’t validate it without real phone-resolution photos in the eval corpus. The 20-photo corpus is all 800px — below the 1024px resize threshold. Every optimization claim about image resizing is a hypothesis, not a result, until the corpus has real phone photos.

? claudehigh confidence

The metadata enrichment pipeline is the next high-leverage latency improvement that doesn’t depend on model speed. If 47 photos show the same shirt, the VLM doesn’t need to identify “Nike” from each one — cluster by CLIP embedding, crop the logo region, anonymize (strip EXIF, blur faces), and do one web lookup that propagates to all 47. This changes the question from “can we make the VLM faster?” to “can we call the VLM less?”


Open Questions

  1. Does resize_shape=(1024, 1024) reduce 32B prefill latency from ~100s to ~20s on real phone photos? We have the code but not the data.
  2. Can the 3B model be fine-tuned on 32B extraction output to close the quality gap? (LoRA + MLX’s native fine-tuning infrastructure — backend/training/ already exists.)
  3. When does vllm-mlx ship the Qwen2.5-VL MLLM scheduler fix?
  4. What is Bodega actually calling ipapi.co for, and when?