More

Yukonv · 2026-04-15T17:12:57 1776273177

It is possible but requires a very specific model design to utilize. As this reverse engineering effort has shown [0] "The ANE is not a GPU. It’s not a CPU. It’s a graph execution engine." To build one requires using a specific pipeline specifically for CoreML [1].

[0] https://maderix.substack.com/p/inside-the-m4-apple-neural-en... [1] https://developer.apple.com/documentation/coreml

jonathaneunice · 2026-04-16T15:37:37 1776353857

That's the best "what is ANE, really?" investigation / explanation I've seen. Directly lays out why LLMs aren't an ideal fit, its "convolution engine" architecture, the need for feeding ANE deep operation sequence plans / graphs (and the right data sizes) to get full performance, the fanciful nature of Apple's performance claims (~2x actually achievable, natch), and the (superior!) hard power gating... just _oodles_ of insight.

Yukonv · 2026-04-07T17:06:49 1775581609

Unsloth quantizations are available on release as well. [0] The IQ4_XS is a massive 361 GB with the 754B parameters. This is definitely a model your average local LLM enthusiast is not going to be able to run even with high end hardware.

[0] https://huggingface.co/unsloth/GLM-5.1-GGUF

zozbot234 · 2026-04-07T18:14:49 1775585689

SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.

Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.

adrian_b · 2026-04-07T19:52:26 1775591546

For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.

QuantumNomad_ · 2026-04-07T20:07:40 1775592460

Three hour coffee break while the LLM prepares scaffolding for the project.

pbhjpbhj · 2026-04-07T21:55:09 1775598909

Like computing used to be. When I first compiled a Linux kernel it ran overnight on a Pentium-S. I had little idea what I was doing, probably compiled all the modules by mistake.

stingraycharles · 2026-04-08T02:52:47 1775616767

I remember that time, where compiling Linux kernels was measured in hours. Then multi-core computing arrived, and after a few years it was down to 10 minutes.

With LLMs it feels more like the old punchcards, though.

drowsspa · 2026-04-07T22:47:30 1775602050

At least the compiler was free

adrian_b · 2026-04-08T07:27:16 1775633236

The point of doing local inference with huge models stored on an SSD is to do it free, even if slow.

fireant · 2026-04-08T23:28:32 1775690912

You are just trading opex for capex. Local GPUs aren't free.

adrian_b · 2026-04-09T10:04:15 1775729055

True, but this is not only a trade-off between opex and capex.

Local inference using open weight models provides guaranteed performance which will remain stable over time, and be available at any moment.

As many current HN threads show, depending on external AI inference providers is extremely risky, as their performance can be degraded unpredictably at any time or their prices can be raised at any time, equally unpredictably.

Being dependent on a subscription for your programming workflow is a huge bet, that you will gain more from a slightly higher quality of the proprietary models than you will lose if the service will be degraded in the future.

As the recent history has shown, many have already lost this bet.

I am not a gambler, so I have made my choice, which is local AI inference, using a variety of models depending on the task, i.e. both small models completely executable on relatively cheap GPUs (like the new Intel GPUs), medium models that need e.g. 128 GB on a CPU, and huge models that must be stored on fast SSDs (e.g. interleaved on multiple PCIe 5.0 SSDs).

Such a strategy is achievable with a modest capex, in the lower half of the 4-digit range.

fireant · 2026-04-09T19:58:02 1775764682

I agree in principle that more democratic compute = better and third parties introduce additional risk that is outside of your control. That said I just don't see it working economically - either you have an underpowered GPU (4-digit range) at which point you have weak model, or slow model, probably both weak and slow. Or you have expensive GPU cluster, but at that point you also need to consider utilization as you are probably not streaming tokens out 24/7 and at that point TCO is just drastically more expensive for self hosting.

Personally I hope we see a third way - strong open weight models hosted by variety of companies actually competing on price and 9s of availability. That way capex expensive GPUs are fully utilized and users can rent intelligence as a commodity.

There is a very apt analogy to virtual server hosting - hosting vps/shared web is a commodity, it does not make financial sense for most users to host their website on their own physical servers in their basements.

tempoponet · 2026-04-08T12:49:09 1775652549

Rather, Imagine you have 2-3 of these working 24/7 on top of what you're doing today. What does your backlog look like a month from now?

cyanydeez · 2026-04-07T20:44:53 1775594693

[flagged]

dcreater · 2026-04-07T21:42:08 1775598128

@dang

zozbot234 · 2026-04-07T20:16:52 1775593012

Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.

Yukonv · 2026-04-05T21:38:57 1775425137

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.

IceWreck · 2026-04-05T22:45:05 1775429105

> So while feasible it's only great for batch jobs not interactive usage.

I mean yeah true but depends on how big the model is. The example I gave (Qwen 3.5 35BA3B) was fitting a 35B Q4 K_M (say 20 GB in size) model in 12 GB VRAM. With a 4070Ti + high speed 32 GB DDR5 ram you can easily get 700 token/sec prompt processing and 55-60 token/sec generation which is quite fast.

On the other hand if I try to fit a 120B model in 96 GB of DDR5 + the same 12 GB VRAM I get 2-5 token/sec generation.

zozbot234 · 2026-04-05T23:03:13 1775430193

Your 120B model likely has way more active parameters, so it can probably only fit a few shared layers in the VRAM for your dGPU. You might be better off running that model on a unified memory platform, slower VRAM but a lot more of it.

IceWreck · 2026-04-06T21:10:34 1775509834

Yep, I understand I was giving an example to the person I was replying to.

zozbot234 · 2026-04-05T22:39:55 1775428795

10 tok/s is quite fine for chatting, though less so for interaction with agentic workloads. So the technique itself is still worthwhile for running a huge model locally.

Yukonv · 2026-04-03T15:02:35 1775228555

The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx

anonyfox · 2026-04-03T17:17:30 1775236650

awesome man, can’t wait! And just now checked it out and indeed 0.3.2 does already work for baseline chatting with mlx versions of Gemma 4 … downloading and comparing different variants right now!

Yukonv · 2026-04-02T23:46:49 1775173609

The model does have the format specified but there is no _one_ standard. For this model it’s defined in the [ tokenizer_config.json [0]. As for llama.cpp they seem to be using a more type safe approach to reading the arguments.

[0] https://huggingface.co/google/gemma-4-31B-it/blob/main/token...

stavros · 2026-04-03T00:21:51 1775175711

Hm, but surely there will be converters for such simple formats? I'm confused as to how there can be calling bugs when the model already includes the template.

Yukonv · 2026-03-31T07:03:48 1774940628

Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.

davesque · 2026-03-31T20:41:33 1774989693

Yeah omlx seems to me like the front runner right now for running MLX models locally in agent workflows (which depend heavily on caching).

Yukonv · 2026-03-23T19:12:39 1774293159

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

anemll · 2026-03-23T19:19:23 1774293563

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

3abiton · 2026-03-23T21:18:50 1774300730

To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.

Yukonv · on Sept 15, 2024

Another good option I’ve personally used if you want a smaller API surface with just Result and Maybe concepts is True Myth. https://true-myth.github.io/true-myth-csharp/

Yukonv · on Jan 15, 2024

Related, Intel was showing off Thunderbolt Share at CES[1]. Allows Thunderbolt 4/5 device-to-device transfer of files. Theoretical speeds in the 20Gbps and 40Gbps for Thunderbolt four and five respective.

One idea for why they were only able to reach 11Gbps is having only one Thunderbolt/USB4 controller[2], meaning the two USB4 ports split the 40Gbps PCIe lane. Throw in a full-duplex connection and you get 10Gbps in one direction.

[1] https://youtu.be/GqCwLjhb4YY?t=81 [2] Just a theory but seems like a sane assumption.

Yukonv · on Jan 11, 2024

Great find! Seems to be a common issue with games, found the same issue with trying to auto play PICO-8 cartrages on web.