Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Strong downvote, if that were possible.

Someone's already quoted the heart rate one where it correctly pointed out that it's possible to die and be resuscitated.

The first one I tried to reproduce myself was verbatim the one immediately before that one in the paper, "Find a model in which P(x) implies Q(x), Q(a) does not hold, and P(a) holds.", and it got that correct too: it tried to give a positive answer, but ended up correctly saying "It seems that the given conditions are contradictory, and no model can satisfy all three conditions simultaneously.". With a small chain-of-thought adjustment it easily produces a proof that the setup is contradictory (https://chat.openai.com/share/d2b4b63e-d585-413d-82c9-19595d...).

I'm not going to go through any of the other ones, but it's clear that the authors are simply wrong (or at least, if they are correct, their reasoning is not evidence of that fact).

----

OK, I am going to go through some of the other ones.

1. Multiplication of four-digit numbers: tick, with chain-of-thought. https://chat.openai.com/share/baa9c362-22fd-4569-b30f-8c9d83...

2. Counting negations: tick, with chain-of-thought. https://chat.openai.com/share/e5f6f928-0bf3-4e60-8a93-014e16...

3. Counting repeated greetings: tick, got this correct verbatim. https://chat.openai.com/share/a92d5d52-c555-45b9-b91f-0f0042...

4. Medical heart rate one: I believe ChatGPT was correct and the author of the paper was wrong here.

5. Elementary logic: this is what my first reproduction was, and it got it correct when verbatim and gave a proof with chain-of-thought. https://chat.openai.com/share/d2b4b63e-d585-413d-82c9-19595d...

6. Quantifiers. I agree that ChatGPT doesn't seem to understand quantifiers and I know no obvious way to rephrase to elicit that knowledge without begging the question (https://chat.openai.com/share/16a046fd-dd68-4c35-bdba-64b63c...). By the way, this mistake is pretty common in humans.

7. Quantifiers, part 2: in my reproduction it parsed the question wrongly so I assume it was doomed from the start (https://chat.openai.com/share/764bf14a-a02c-4871-9c22-0be840...). Again, I'm perfectly happy to believe it simply can't do this; many humans can't do this either.

---

I'll stop here, because we've hit a problem of reasoning about graph vertex colourings, where I myself would struggle to verify any answer given only as free text without drawing a diagram; that question seems to be grossly unfair.



I've commented below that I have tried playing some lateral thinking games with chatGPT and I have found it to be pretty poor at understanding what is going on when it has limited information/context to work with. I had played around with implementing chatGTP as a bot player on my "dark stories" app: https://detective-stories.com/ but I found that it wasn't close to even an average human player when it came to the skills needed to play either role in the game.


Oh absolutely a fair criticism! Personally I've all but stopped using ChatGPT (I used it six times in the last seven days before today, and two of them were for the same question about GitHub Markdown) because it's just too unreliable. But I really resent the preprint being given as evidence for ChatGPT's disutility, because it's simply bad evidence.


Might there be another explanation which fits the facts other than the authors being wrong?


I'm all ears!


OpenAI read the paper and changed the model?


Quick work, if they did so since the preprint was posted six days ago, of which two were a weekend! My version of ChatGPT claims to be the 3rd August version, which gave them one day to respond unless they were somehow targeting some sneak peek pre-preprint.


Don't know how much time they need to tweak their model but here is another possibility.

OoenAI sells GPT 4 but it's only GPT 3.5 because of lack of resources.

Or more sinister, they knew what the author was about to test and gave him the inferior model so it could be easily debunked.


27th July was the first version of the paper.

https://www.preprints.org/manuscript/202308.0148/v2


A whole four working days to adjust the model in between preprint release and the version of ChatGPT I'm using, then! Do you think that's plausible? I certainly don't.


Or simply the model was improved between the author's test and the release of the paper.

BTW the time stamp of the model is easily falsifiable.

We are talking about a billion dollar business opportunity so expect foul play all along.


Yeah man they have teams on standby to adjust the model whenever a random unknown author posts something on obscure pre-print servers. Then they spend hundreds of thousands of compute $ to improve the model on that one metric the paper attacks.


Have you tried a similar question with different parameters?

It's pretty easy if you assume people are checking the exact same quote.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: