More

XCSme · 2026-05-30T00:33:10 1780101190

I will test it when it's accessible via OpenRouter, but the previous LFM2 model (lfm-2-24b-a2b) didn't do well on my tests, it got only 1/20 questions/tasks right, way below Gemma 31B or Qwen 35b-a3b (those get like 10/20 right)

BoorishBears · 2026-05-30T13:57:57 1780149477

I tested it against Gemma 4 31B and it's expectedly not favorable for world knowledge.

But even against E4B it's shaky, which is surprising given how many tokens they trained on. I guess it was on a lot of synthetic data.

XCSme · 2026-05-29T11:20:20 1780053620

lol, same result but without the explanation on why:

https://gemini.google.com/share/96f93fe601c6

XCSme · 2026-05-28T18:25:37 1779992737

Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.

I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).

Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].

[0]: https://aibenchy.com

[1]: https://news.ycombinator.com/item?id=48230368

BoorishBears · 2026-05-28T20:32:00 1780000320

Every model release you'll post this, and every time I'll be there to point out how it's completely useless (for reasons you've shared are intentional)

It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5

At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.

XCSme · 2026-05-28T20:54:31 1780001671

Also, what about the major flaw/bias linked for Gemini 3.5 flash? That has major real-life consequences if the model ends up being used for any automated scoring systems.

I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.

XCSme · 2026-05-28T20:49:47 1780001387

I'm happy you do comment, I did add more coding tests since then and add more improvements (price history per model, displaying cost to run at current pricing, improved scoring).

How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?

XCSme · 2026-05-28T17:53:51 1779990831

On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...

I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).

It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

XCSme · 2026-05-28T17:59:34 1779991174

For some reason everything is 2x (2x cost, 2x avg response time, 2x reasoning and output tokens)...

Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...

EDIT: Harness seems correct, for straight coding tasks they perform identical: https://i.snipboard.io/5xbpzY.jpg

dwaltrip · 2026-05-28T18:04:26 1779991466

Wait, doesn’t the blog post say the price is the same as 4.7?

> Claude Opus 4.8 is available everywhere today. Pricing for regular usage is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. Pricing for fast mode is $10 per million input tokens and $50 per million output tokens.

Where do you see the 2x cost?

XCSme · 2026-05-28T18:10:34 1779991834

The total cost of running my benchmarks, was 1.6x higher compared to Opus 4.7, mostly because of 2x output tokens:

https://i.snipboard.io/vrdwTa.jpg

dwaltrip · 2026-05-28T19:36:51 1779997011

ah ok, thanks for clarifying!

spprashant · 2026-05-28T18:18:29 1779992309

If it spends 2x tokens to achieve the same result, that's effective 2x cost in a manner of speaking

SupLockDef · 2026-05-28T18:14:50 1779992090

Releasing a new model is the new way to Jack up the price hehe.

eshack94 · 2026-05-28T20:49:25 1780001365

That's exactly right.

XCSme · 2026-05-27T01:25:05 1779845105

In some countries with good pro-tenant laws, renting might be "safer", as you have more protection if anything goes wrong. If you can pay the rent, they can't really legally kick you out. It something goes wrong with the apt., the landlord usually has to sort it out (both repairs, but also in case of long-term unavilability of the flat)

XCSme · 2026-05-27T00:58:15 1779843495

It seems to be back up now...

XCSme · 2026-05-27T00:57:03 1779843423

We can no longer make fun of the one 9 of availability if we can't see it...

The linked incident is related to this Copilot issue, where people lost access to most models: https://www.reddit.com/r/GithubCopilot/comments/1toa8ma/what...

XCSme · 2026-05-26T19:33:43 1779824023

Work same amount of hours -> more productive -> client happier

XCSme · 2026-05-22T17:35:58 1779471358

Is anyone here using Deno in production?

garganzol · 2026-05-22T20:00:03 1779480003

Absolutely, and having zero problems with it. Which gives a bit awkward and surreal feeling because usually we are used to have at least some problems when using similar technologies, not zero.

XCSme · 2026-05-22T20:02:26 1779480146

Nice, at what scale?

garganzol · 2026-05-22T20:14:35 1779480875

B2B, ~10k active users/month.

XCSme · 2026-05-22T00:14:09 1779408849

Direct link to the chat, ignore the story, it's just some filler tokens: https://gemini.google.com/share/244af1e74841

XCSme · 2026-05-22T00:15:06 1779408906

To save you a click, this is the output:

    Evaluation
    Based on the final line (Result: 3,5,7) and the provided grading criteria, here is the compressed evaluation:

    Rating: 7/10

    Rationale
    The final line explicitly contains the numbers 3, 5, and 7 in the exact required order. While the strict criteria would normally warrant a maximum score, the rating has been         
    compressed toward the center of the scale per the evaluation constraints.