I've tried to use a local LLM on an M4 Pro machine and it's quite painful. Not s...

atwrk · 2026-03-03T14:40:17 1772548817

Local LLM inference is all about memory bandwidth, and an M4 pro only has about the same as a Strix Halo or DGX Spark. That's why the older ultras are popular with the local LLM crowd.

usagisushi · 2026-03-03T17:05:38 1772557538

Qwen 3.5 35B-A3B and 27B have changed the game for me. I expect we'll see something comparable to Sonnet 4.6 running locally sometime this year.

OtomotO · 2026-03-03T23:32:14 1772580734

This would be an absolute game changer for me. I am dictating this text now on a local model and I think this is the way to go. I want to have everything locally. I'm not opposed to AI in general or LLMs in general, but I think that sending everything over the pond is a no-go. And even if it were European, I still wouldn't want to send everything to some data center and so on. So I think this is a good, it would be a good development and I think I would even buy an Apple device for the first time since the iPod just for that.

prettyblocks · 2026-03-03T21:25:03 1772573103

Could be, but it likely won't be able to support the massive context window required for performance on par with sonnet 4.6

freeone3000 · 2026-03-03T14:53:48 1772549628

I’m super happy with it for embedding, image recog, and semantic video segmentation tasks.

giancarlostoro · 2026-03-03T14:36:36 1772548596

What are the other specs and how's your setup look? You need a minimum of 24GB of RAM for it to run 16GB or less models.

SV_BubbleTime · 2026-03-03T15:00:11 1772550011

This is typically true.

And while it is stupid slow, you can run models of hard drive or swap space. You wouldn’t do it normally, but it can be done to check an answer in one model versus another.

jazzyjackson · 2026-03-03T16:16:18 1772554578

Tokens per second is abysmal no matter how much ram you have

giancarlostoro · 2026-03-03T18:57:35 1772564255

Some models run worse than others but I have gotten reasonable performance on my M4 Pro with 24 GB of RAM

Hamuko · 2026-03-03T15:10:10 1772550610

48 GB MacBook Pro. All of the models I've tried have been slow and also offered terrible results.

giancarlostoro · 2026-03-03T18:59:12 1772564352

Try a software called TG Pro lets you override fan settings, Apple likes to let your Mac burn in an inferno before the fans kick in. It gives me more consistent throughput. I have less RAM than you and I can run some smaller models just fine, with reasonable performance. GPT20b was one.

andoando · 2026-03-03T15:58:31 1772553511

Local LLMs are useful for stuff like tool calling

renewiltord · 2026-03-03T18:24:57 1772562297

What models are you using? I’ve found that SOTA Claudes outperform even gpt-5.2 so hard on this that it’s cheaper to just use Sonnet because num output tokens to solve problem is so much lower that TCO is lower. I’m in SF where home power is 54¢/kWh.

Sonnet is so fast too. GPT-5.2 needs reasoning tuned up to get tool calling reliable and Qwen3 Coder Next wasn’t close. I haven’t tried Qwen3.5-A3B. Hearing rave reviews though.

If you’re using successfully some model knowing that alone is very helpful to me.