Which only appears relevant if you disregard critical differences like this:
The GCC garbage collector GGC is only invoked explicitly. In contrast with many other garbage collectors, it is not implicitly invoked by allocation routines when a lot of memory has been consumed. [1]
There are many C++ programmers and we are not the same!
My original foray into GCs was making real time ones, and the Fil-C GC is based on that work. I haven’t fully made it real time friendly (the few locks it has aren’t RT-friendly) but if I had more time I could make it give you hard guarantees.
It’s already full concurrent and on the fly, so it won’t pause you
Paged Attention is more of a low-level building block, aimed initially at avoiding duplication of shared KV-cache prefixes in large-batch inference. But you're right that it's quite related. The llama.cpp folks are still thinking about it, per a recent discussion from that project: https://github.com/ggml-org/llama.cpp/discussions/21961
Because he should know better? Because it’s obviously a shit show but he keeps on being very vocal about his shit show? Because it’s annoying to have to see yet another delusional vibe coded project being hyped up instead of this forum being used to discuss actually industry relevant information?
> Because it’s annoying to have to see yet another delusional vibe coded project being hyped up instead of this forum being used to discuss actually industry relevant information?
It's industry-relevant. This is what the industry is now. All in two short years.
he's doing it in the open. Its instructive for us all either way.
> he keeps on being very vocal about his shit show?
I'm not really sure what this complaint is. You want someone doing something to not.... write a blog about it?
> Because it’s annoying to have to see yet another delusional vibe coded project being hyped up instead of this forum being used to discuss actually industry relevant information?
I think I've seen around 2 posts, one the original gastown one and then the gascity one. Is two posts in like a year too much or do I miss a midday rush where the front page is all Yegge?
It seems pretty transparent that they are heavily resource constrained, (training run for Claude 5.x, higher usage / growth than anticipated). I don’t disagree that their long play is monopolistic pricing, but what we’re observing seems better explained by the fact they have a very tight compute budget they are trying to optimize over to put as much as they can into next gen experiments / training to make sure they stay competitive over the next 6-months / year.
The limitation is efficiency and efficacy. If you have to add an additional layer of inference to any request you’re negatively impacting your bottom line so the companies, which are compute bound, have a strong incentive to squeeze everything into a single forward pass. It’s also not clear that a separate model that is smaller than the main model will perform better than just training the main model to detect prompt injection. They are both probabilistic models that have no structural way of distinguishing user input from malicious instructions.
Very excited for the 122b version as the throughput is significantly better for that vs the dense 27b on my m4.