Someone's already quoted the heart rate one where it correctly pointed out that it's possible to die and be resuscitated.
The first one I tried to reproduce myself was verbatim the one immediately before that one in the paper, "Find a model in which P(x) implies Q(x), Q(a) does not hold, and P(a) holds.", and it got that correct too: it tried to give a positive answer, but ended up correctly saying "It seems that the given conditions are contradictory, and no model can satisfy all three conditions simultaneously.". With a small chain-of-thought adjustment it easily produces a proof that the setup is contradictory (https://chat.openai.com/share/d2b4b63e-d585-413d-82c9-19595d...).
I'm not going to go through any of the other ones, but it's clear that the authors are simply wrong (or at least, if they are correct, their reasoning is not evidence of that fact).
----
OK, I am going to go through some of the other ones.
6. Quantifiers. I agree that ChatGPT doesn't seem to understand quantifiers and I know no obvious way to rephrase to elicit that knowledge without begging the question (https://chat.openai.com/share/16a046fd-dd68-4c35-bdba-64b63c...). By the way, this mistake is pretty common in humans.
7. Quantifiers, part 2: in my reproduction it parsed the question wrongly so I assume it was doomed from the start (https://chat.openai.com/share/764bf14a-a02c-4871-9c22-0be840...). Again, I'm perfectly happy to believe it simply can't do this; many humans can't do this either.
---
I'll stop here, because we've hit a problem of reasoning about graph vertex colourings, where I myself would struggle to verify any answer given only as free text without drawing a diagram; that question seems to be grossly unfair.
I've commented below that I have tried playing some lateral thinking games with chatGPT and I have found it to be pretty poor at understanding what is going on when it has limited information/context to work with.
I had played around with implementing chatGTP as a bot player on my "dark stories" app: https://detective-stories.com/ but I found that it wasn't close to even an average human player when it came to the skills needed to play either role in the game.
Oh absolutely a fair criticism! Personally I've all but stopped using ChatGPT (I used it six times in the last seven days before today, and two of them were for the same question about GitHub Markdown) because it's just too unreliable. But I really resent the preprint being given as evidence for ChatGPT's disutility, because it's simply bad evidence.
Quick work, if they did so since the preprint was posted six days ago, of which two were a weekend! My version of ChatGPT claims to be the 3rd August version, which gave them one day to respond unless they were somehow targeting some sneak peek pre-preprint.
A whole four working days to adjust the model in between preprint release and the version of ChatGPT I'm using, then! Do you think that's plausible? I certainly don't.
Yeah man they have teams on standby to adjust the model whenever a random unknown author posts something on obscure pre-print servers. Then they spend hundreds of thousands of compute $ to improve the model on that one metric the paper attacks.
Someone's already quoted the heart rate one where it correctly pointed out that it's possible to die and be resuscitated.
The first one I tried to reproduce myself was verbatim the one immediately before that one in the paper, "Find a model in which P(x) implies Q(x), Q(a) does not hold, and P(a) holds.", and it got that correct too: it tried to give a positive answer, but ended up correctly saying "It seems that the given conditions are contradictory, and no model can satisfy all three conditions simultaneously.". With a small chain-of-thought adjustment it easily produces a proof that the setup is contradictory (https://chat.openai.com/share/d2b4b63e-d585-413d-82c9-19595d...).
I'm not going to go through any of the other ones, but it's clear that the authors are simply wrong (or at least, if they are correct, their reasoning is not evidence of that fact).
----
OK, I am going to go through some of the other ones.
1. Multiplication of four-digit numbers: tick, with chain-of-thought. https://chat.openai.com/share/baa9c362-22fd-4569-b30f-8c9d83...
2. Counting negations: tick, with chain-of-thought. https://chat.openai.com/share/e5f6f928-0bf3-4e60-8a93-014e16...
3. Counting repeated greetings: tick, got this correct verbatim. https://chat.openai.com/share/a92d5d52-c555-45b9-b91f-0f0042...
4. Medical heart rate one: I believe ChatGPT was correct and the author of the paper was wrong here.
5. Elementary logic: this is what my first reproduction was, and it got it correct when verbatim and gave a proof with chain-of-thought. https://chat.openai.com/share/d2b4b63e-d585-413d-82c9-19595d...
6. Quantifiers. I agree that ChatGPT doesn't seem to understand quantifiers and I know no obvious way to rephrase to elicit that knowledge without begging the question (https://chat.openai.com/share/16a046fd-dd68-4c35-bdba-64b63c...). By the way, this mistake is pretty common in humans.
7. Quantifiers, part 2: in my reproduction it parsed the question wrongly so I assume it was doomed from the start (https://chat.openai.com/share/764bf14a-a02c-4871-9c22-0be840...). Again, I'm perfectly happy to believe it simply can't do this; many humans can't do this either.
---
I'll stop here, because we've hit a problem of reasoning about graph vertex colourings, where I myself would struggle to verify any answer given only as free text without drawing a diagram; that question seems to be grossly unfair.