There's also this seven-week-old example [0] (linked in the essay) of ChatGPT *v...

simianwords · 2026-04-08T18:58:20 1775674700

This is classic case of not using the proper version. Use the thinking version gpt5.4 (text) and tell me if it bullshits.

Surely you must be able to find at least one example no?

simoncion · 2026-04-08T19:09:56 1775675396

To be clear, is your assertion that apyhr was also not using the proper version? If that is your assertion, do tell me how you've come by that information.

(You did notice that the author of the essay and the author of the video I linked to are not the same person, and that neither of them share a nym with me, yes?)

simianwords · 2026-04-08T19:16:59 1775675819

Hi, my position on the issue is that LLMs are powerful but may make mistakes in long context problems like coding (which the harness solves by feedback). But makes close to no (undergrad level) mistakes in questions that fit 2-3 pages. For you personally: do you believe me on this specific part on 2-3 pages?

I don't know what aphyr did and tbh his whole screed on LLMs make me feel he didn't use it properly or at least coming from a bad faith angle.

That's why I'm asking you (and others). Please come up with a text prompt spanning < 4 pages and lets see if it bullshits.

Surely the implication of such a screed is that it should be super simple to find at least one example of it clearly bullshitting in my constraint, no? Or am I interpreting the post in a bad faith way?

simoncion · 2026-04-08T19:29:12 1775676552

Neat.

So, despite the fact that it looks like you have to pay for ChatGPT Voice mode with video, [0] it doesn't count as an

  example of it bullshitting on ChatGPT (paid version)

That is, father_phi's use of what seems to be a paid version of ChatGPT to have a bullshit-filled conversation that definitely spans less than four pages doesn't count?

[0] The page at [1] declares that the video feature is "Available in ChatGPT Plus, Pro, Business, Enterprise, and Edu on mobile"

[1] <https://chatgpt.com/features/voice-with-video/>

simianwords · 2026-04-08T19:35:42 1775676942

Lets stick to my challenge please - thinking version, find bullshit. If you can't, thats ok. Do you accept then under the constraints that the thinking version doesn't produce bullshit?

simoncion · 2026-04-08T19:49:55 1775677795

Given aphyr's vocation (and how very lucrative it is), and how years and years of his writing indicates that he's very devoted to getting a correct and complete answer when investigating a question, I find it hard to believe that he's not using a paid version of the LLMs. If I knew him, I'd ask and verify, but I don't, so I won't.

> Lets stick to my challenge please...

I did. Your challenge was literally:

  If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)? Lets take any example of a text prompt fitting a few pages - it may be a question in science or math or any domain. Can you get it to bullshit?

father_phi's two-sentence question about the whether one can use a cup that's closed at the top and open at the bottom definitely counts. Given what I've mentioned about apyhr above, I expect he has already run your challenge on the fanciest-available version and reported on the results in the essay under discussion.

simianwords · 2026-04-08T19:53:47 1775678027

> Use the thinking version gpt5.4 (text) and tell me if it bullshits

This was what I said. Text! Despite me specifically asking for text, you've shown a voice example. Not sure why?

I believe you and I agree that GPT 5.4 thinking on text that fits < 4 pages never bullshits? Then we are good!

If we agree on this, I think the post doesn't capture this in spirit.

simoncion · 2026-04-08T19:59:24 1775678364

> This was what I said. Text!

No, that's what you said after I provided an example of paid ChatGPT emitting complete bullshit from a two sentence prompt.

The challenge you issued is at [0].

[0] <https://news.ycombinator.com/item?id=47692592>

simianwords · 2026-04-08T20:02:25 1775678545

> If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)? Lets take any example of a text prompt fitting a few pages - it may be a question in science or math or any domain. Can you get it to bullshit?

I have clearly written text prompt here. And I repeated a few times. It’s not my fault you didn’t read it. You are coming across as a bit of a bad faith arguer.

In any case, you agree that under these constraints bullshitting doesn’t exist?

simoncion · 2026-04-08T20:07:40 1775678860

> I have clearly written text prompt here.

How do you think the "voice" interface works? It runs speech-to-text on the input and turns the input into text. The LLMs don't decode voice, they work on text.

You can see this process in action on many of father_phi's videos.

Regardless, I expect that aphyr's reported results are on the very latest publicly-available ChatGPT models.

simianwords · 2026-04-08T20:11:37 1775679097

Very bad faith arguments. I clearly said text and you disregarded it multiple times and you are still arguing.

You've still not given me a single example of it bullshitting 5.4 thinking in text. It shows a lot that you have ignored this multiple times. Unfortunate!

simoncion · 2026-04-08T20:22:31 1775679751

I'm not sure why you're ignoring aphyr's reports. I'm also unsure why you're ignoring my original statement that having the text of the conversation that lead ChatGPT to bullshit is entirely irrelevant, as being unable to repro the report is even worse for ChatGPT than being able to repro would be.

shrug

simianwords · 2026-04-08T20:25:44 1775679944

I specified text just to ignore the voice one because it uses 4o-mini underneath. And its kinda stupid to keep ignoring that and saving face now - reconsider this approach.

I believe this is the 5th time I'm asking this: you are not able to produce a _single_ counter example for my challenge? After all this surely I can get a direct acknowledgement here.

simoncion · 2026-04-08T20:47:15 1775681235

> you are not able to produce a _single_ counter example for my challenge?

I have. For both your original challenge and your updated one.

Consider:

1) AFAICT, there's no way to tell what version of the model was used to produce the output in a ChatGPT share link.

2) You don't appear to believe my assertions that aphyr is almost certainly paying for and using the latest version of the LLMs available, and that he's faithfully reporting his interactions with the LLMs.

3) Because of #2, I expect that you won't believe me if I report that I've more-or-less reproduced father_phi's results about the cup that's sealed on the top and open on the bottom on the very latest only-available-for-pay ChatGPT model.

3a) You might attempt to check my report, but I'd be shocked if you'd consider a failure to reproduce my results to be a significant strike against ChatGPT. I'd think it's more likely that you'd either call me a liar, or tell me that I must have had some setting wrong somewhere.

3b) Even if you told me to share the ChatGPT chat that proved my assertion, #1 -combined with your demeanor throughout this conversation- tells me that you'd almost certainly claim that I was using an inferior version of the model and was lying to you.

simianwords · 2026-04-08T21:02:49 1775682169

Haha ok. So still no example?

The GPT shared link shows a "thought for" which indicates using the latest thinking model. You may try that.

What you can do is this: submit a prompt that clearly makes GPT hallucinate.

You may secretly use a worse model. You may use a system prompt that deliberately gives wrong answers. But I'm going to assume you won't go that far.

We can leave it to the public to decide whether this is a legitimate counter example or not and whether it can really be reproduced. Shall we try that? I'm guessing you won't but worth a shot!

simoncion · 2026-04-08T21:53:24 1775685204

You weren't paying much attention to the "Consider:" part of my previous comment.

You don't believe that a well-paid, very careful, high-integrity member of the computer safety community has -on multiple occasions- encountered actual, sustained bullshiting from the latest-available for-pay version of ChatGPT. You don't accept either this fellow's reports or my informed assessment of his computing situation as truthful and accurate. On top of that, your goalpost-shifting and general demeanor throughout this conversation simply don't give me the impression that you've much integrity. I'm not spending the equivalent of ten-to-twenty six-packs to reproduce aphyr's work and -given the evidence I have before me- have you reject that, as well.

200 USD is a lot of money to throw away to "win" an Internet argument with a stranger who refuses to accept evidence presented by someone known to be careful, scrupulous, and honest.

simianwords · 2026-04-08T22:05:17 1775685917

> On top of that, your goalpost-shifting and general demeanor throughout this conversation simply don't give me the impression that you've much integrity. I'm not spending the equivalent of ten-to-twenty six-packs to reproduce aphyr's work and -given the evidence I have before me- have you reject that, as well.

Lol what goal post did I move? I said text only and you rejected it. You can present the example here and let the public judge it - even if my integrity is compromised. I'm allowing you to do it.

> 200 USD is a lot of money to throw away to "win" an Internet argument with a stranger who refuses to accept evidence presented by someone known to be careful, scrupulous, and honest.

200 what? I'm using the $20 one. This is getting ridiculous!

You can't present a _single_ counter example!

simoncion · 2026-04-09T21:47:59 1775771279

> You can't present a _single_ counter example!

Correct. I've presented a _pair_ of examples.