Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Every paper and comment I've seen that claims "GPT-4 can't reason" is either an example of bad prompting, bad/vague english, or taking advantage of the limitations of tokenization (like asking GPT-4 how many letters there are in some word).

I would like to see one - just ONE example - of a well-defined logical puzzle of that most english speakers can get right but GPT-4 consistently fails on, even if you prompt it to think through the problem step-by-step. Any takers?




> Every paper and comment I've seen that claims "GPT-4 can't reason" is either an example of bad prompting, bad/vague english, or taking advantage of the limitations of tokenization (like asking GPT-4 how many letters there are in some word).

The problem isn't bad prompting. The problem is lack of repetition. You can ask GPT4 the same question 10 times (with same config) and you'll get wildly different/nondeterministic responses. Sometimes it accidentally happens to be correct (IME much less than half the time). Even if it was 50%, would you say a coin flip reasons? Does an 8 ball reason?


There is a huge difference between being correct 50% of the time (like a coin flip) and correct 51% of the time. Because in the second case, you can run the model 10,000 times and the median response will be the correct one at least 97% of the time. A coin can't do that. Any papers evaluating GPT-4's responses should be doing repeat trials and building confidence intervals, like any other research. Anything else is just bad science.


Remind me, why do we need a computer program that gets it wrong 49% of the time and has to be run 10,000 times to get it right almost half the time, and in questions that can be answered correctly 100% of the time with a different program? And taking into account the 49%-wrong program costs millions to train and requires gigantic amounts of data and 100+ person teams to create, which alternative programs don't.

What, at the end of the day, is the use of a computer stripped of computer-like precision and recall (as in the ability to retrieve facts from memory)?

Why are people so excited about a piece of software that works only for some people, some of the time, like homeopathy or astrology?

And what does all that have to do with science?


Heck, ask ChatGPT if it can understand error reduction by iteration.

"How can I get my program that produces correct responses 66% of the time to always produce a correct response?"

I'm not paying for that answer. Note that it requires inverting the mathematics to prove your result is always correct.

After asking GPT that, ask it to explain how it arrived at this conclusion step by step.


Sudoku? But there are endless failings in its reasoning, they just don’t come up during one-off questions, but during more complex discussions with it.


Sudoku is a decent example (which has apparently been solved, but only through very specific prompting [1]), though I would be more interested in puzzles that require a lot of arithmetic, since it's already clear that GPT-4 struggles with math and counting.

1. https://strange-prompts.ghost.io/i-taught-gpt-4-to-solve-sud...


I've found GPT-3.5 to be quite bad at small traveling salesman type problems.

I used the following prompt just now on https://chat.openai.com:

> Imagine that I take a road trip starting from Amsterdam, and visiting the cities Warsaw, Budapest, Brussels, Paris, and Berlin, then returning to Amsterdam. In what sequence should I visit those cities in order to minimise the total driving time?

> To minimize the total driving time for your road trip starting from Amsterdam and visiting the cities Warsaw, Budapest, Brussels, Paris, and Berlin, you should plan your route based on the most efficient sequence. One way to achieve this is by using a route optimization algorithm, like the Traveling Salesman Problem (TSP) solver. However, keep in mind that road conditions, traffic, and other factors can impact the actual travel time.

>Here's a sequence that aims to minimize the total driving time based on a general estimation of distances between the cities:

> Amsterdam > Brussels > Paris > Berlin > Warsaw > Budapest > Amsterdam (return)

>This sequence follows a route that minimizes the overall distance traveled. Remember to use a navigation tool or app to get real-time traffic updates and directions during your trip, as actual conditions might differ from estimates.

The result is not optimal. It takes 45 hours: https://www.google.com/maps/dir/Amsterdam,+Netherlands/Bruss...

The following sequence takes only 42: https://www.google.com/maps/dir/Amsterdam,+Netherlands/Bruss...

I've not tested GPT-4 as I don't have any reason to pay for it, but I'd be interested to know if it has a similar problem. My hunch is that it will never be very good at solving graph-theoretic problems.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: