sam_1421's comments

sam_1421 · 2026-02-19T17:02:16 1771520536

Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes

cbsks · 2026-02-19T17:34:59 1771522499

That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”

https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

travisgriggs · 2026-02-19T18:34:22 1771526062

So once that's achieved, I wonder how well it deals with unsuspected variations. E.g.

"Give me an illustration of a bicycle riding by a pelican"

"Give me an illustration of a bicycle riding over a pelican"

"Give me an illustration of a bicycle riding under a flying pelican"

So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE

embedding-shape · 2026-02-19T17:03:57 1771520637

Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.

Rudybega · 2026-02-19T20:53:28 1771534408

If only there were some way to test it, like swapping the two nouns in the sentence. Alas.

jsheard · 2026-02-19T17:05:46 1771520746

Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.

stri8ted · 2026-02-19T17:18:09 1771521489

Exactly. As far as I'm concerned, the benchmark is useless. It's way too easy and rewarding to train on it.

bonoboTP · 2026-02-19T19:33:31 1771529611

It's just an in-joke, he doesn't intend it as a serious benchmark anymore. I think it's funny.

Legend2440 · 2026-02-19T17:44:33 1771523073

Y'all are way too skeptical, no matter what cool thing AI does you'll make up an excuse for how they must somehow be cheating.

toraway · 2026-02-19T19:14:56 1771528496

Jeff Dean literally featured it in a tweet announcing the model. Personally it feels absurd to believe they've put absolutely no thought into optimizing this type of SVG output given the disproportionate amount of attention devoted to a specific test for 1 yr+.

I wouldn't really even call it "cheating" since it has improved models' ability to generate artistic SVG imagery more broadly but the days of this being an effective way to evaluate a model's "interdisciplinary" visual reasoning abilities have long since passed, IMO.

It's become yet another example in the ever growing list of benchmaxxed targets whose original purpose was defeated by teaching to the test.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

arcatech · 2026-02-19T18:49:56 1771526996

Or maybe you’re too trusting of companies who have already proven to not be trustworthy?

pixl97 · 2026-02-19T17:46:09 1771523169

I mean if you want to make your own benchmark, simply don't make it public and don't do it often. If your salamander on skis or whatever gets better with time it likely has nothing to do with being benchmaxxed.

ks2048 · 2026-02-19T19:25:36 1771529136

Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.