Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ChatGPT Plus user here. The following are all fresh sessions and first answers, no fishing.

GPT 4:

The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots. The type of material doesn’t affect the weight comparison; it’s purely a matter of which bag weighs more on the scale.

GPT 4o:

The 10.01-pound bag of fluffy cotton is heavier. Weight is independent of the material, so the bag of cotton’s 10.01 pounds outweighs the steel ingots’ 9.99 pounds.

GPT o1:

Since both weights are measured on the same scale (pounds), the 10.01-pound bag of cotton is heavier than the 9.99-pound bag of steel, despite steel being denser. The key is simply that 10.01 pounds exceeds 9.99 pounds—density doesn’t affect the total weight in this comparison.



So do what the commenter suggests and make irrelevant permutations to the input to find when it fails. ie., engage in hypothesis testing rather than confirmation bias.

If a system has the capability to solve problems of {parts1...parts_n}, then it only has that capability if irrelevant permutations {parts1..parts2'...parts_n} make no difference.

Its very obvious that such permutations can destory such apparent capabilities.


I've just tested a number of permutations with Claude 3.5 Sonnet. It correctly answered all variants I tried on the first attempt, as follows:

Which is heavier, a 9.99 kilogram tungsten cube or a 10.01 kilogram block of aerogel?

Which is heavier, 10,000 steel balls weighing 0.999 grams each or 10,000 polystyrene balls weighing 1.001 grams each?

Which is heavier, a 10.01kg block of steel on Venus or a 9.99kg bag of feathers on Earth?

Which is heavier, a 10cm^3 block of steel or a 100cm^3 block of balsa wood?

Which is heavier, a golf ball made of steel or a baseball made of lithium?

In all cases, Claude clearly used CoT and reasoned out the problem in full. I would be interested in seeing if anyone can find any variant of this problem that stumps any of the leading LLMs. I'm bored of trying.


Hey, ChatGPT please write me a python program which randomly samples from various materials and various weights then poses a problem to the ChatGPT 4o API -- the goal is to find cases where the LLM fails to obtain the correct answer....


BTW - the model may be wrong depending on the example. More voluminous objects displace more air and due to buoyancy are lighter for the same mass.

The proper way to ask it would be to ask which object has more mass.


For what it’s worth, I asked it the following:

Q:

Which weighs more on earth: 100kg of steel or a 100kg polystyrene ball? Be as technically correct as possible

A: Technically, 100 kg of steel and 100 kg of polystyrene weigh the same on Earth when measured on a scale, as they both experience the same gravitational force due to their equal mass. However, there are some nuances to consider:

Buoyancy Effect

Earth’s atmosphere exerts an upward buoyant force on objects, proportional to the volume of the object and the density of the surrounding air. Since polystyrene has a much lower density than steel, a 100 kg polystyrene ball will have a significantly larger volume and experience a greater buoyant force compared to 100 kg of steel.

As a result: • Effective weight (apparent weight): The steel will weigh slightly more than the polystyrene on a scale because the buoyant force acting on the polystyrene reduces its apparent weight more.

Actual Weight

Weight is the gravitational force on an object and is given by:

Where and . Without considering buoyancy, both objects have the same weight of approximately (newtons).

Conclusion • True weight (in a vacuum): Both are identical at 981 N. • Apparent weight (in air): Steel is slightly heavier due to reduced buoyant force acting on it compared to the polystyrene ball.


Yup, the models are smart, but are trained to follow standard human patterns for this type of questions. And even on hackernews vast majority will not think that they would need to correct for buoyancy when actually attempting the experiment in standard conditions.

They very often get popular "tricky" questions wrong because they saw it so many times that they switch from internal reasoning to memorization/retrieval.


I tried 3 times the "Which is heavier, a 10.01kg block of steel on or a 9.99kg bag of feathers?" and ChatGPT keep converting kg to pound and saying the 9.99kg is heavier.


Which model? On the paid plus tier, GPT-4o, GPT-o1, and GPT-o1mini all successfully got the 10.1. I did not try any other models.

gpt-4o: https://chatgpt.com/share/67768221-6c60-8009-9988-671beadb5a...

o1-mini: https://chatgpt.com/share/67768231-6490-8009-89a6-f758f0116c...

o1: https://chatgpt.com/share/67768254-1280-8009-aac9-1a3b75ccb4...


Did you try it more than once per model? I could see temperature or other similar gradients providing wrong answers.


Also couldn't get o1 to fail. I tried the following with o1:

"Which is heavier, a solid titanium box weighing 9.9 flubs, or an empty wooden box weighing 10.1 blobs, where 1 flub = 1 kg, and 1 blob is 1kg".

The answer: "Since 1 flub = 1 kg and 1 blob = 1 kg, the titanium box’s mass is 9.9 kg and the wooden box’s mass is 10.1 kg. Therefore, the wooden box (10.1 kg) is heavier."

Thought that was pretty impressive.


I found that the models consistently ignore buoyancy (objects with the same mass but different density can have different weight as measured by a scale).


If GP's hypothesis was "it fails for small variations of the input, like this one", then testing that hypothesis with that exact variation on a couple models seems fair and scientific.

Testing it with more variations until one fails feels a bit like p-hacking. You'd need to engage in actual statistics to get reliable results from that, beyond "If I really try, I can make it fail". Which would be a completely different hypothesis than the one presented at the start


I feel like I'm almost 100% certain that the smart guys at OpenAI have added many more variations of the problem to their training set since OP did his failing test, so it doesn't surprise me at all to know that this exact one now passes.

In fact, in my use of o1 it's incredibly clear that it still has the same problems. It's incredibly common that the second I ask for someone even slightly outside the training set, it's more likely to "round" to some wrong solution in the training set, rather than use any sort of human-like reasoning to figure out the right answer (often the right answer isn't hard to get, just not found in a Google search).


Can’t really do science with closed source software, right? Who knows what’s in there.


I asked "Which is louder, a 12.5 decibel scream, or a 12.8 decibel whisper?", which I wouldn't think has much of a presence in its training set, and it answered correctly.

May still get it wrong in more subtle ways, though. Personally, I think it'll continue to get physics wrong until someone builds it some robot arms so it can train on actually interactive physical spaces and behavior.


Except that if the model genuinely was reasoning about the problem, you could test it with every variation of materials and weights in the world and it would pass. Failing that problem at all in any way under any conditions is a failure of reasoning.


By that logic, humans can't genuinely reason, because they're often fooled by counter-intuitive problems like Monty Hall or the Birthday Problem, or sometimes just make mistakes on trivial problems.


I’m not really sure what you’re trying to say here - that LLMs don’t work like human brains? We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work. That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.


>We don’t need to conduct any analyses to know that LLMs don’t “know” anything in the way humans “know” things because we know how LLMs work.

People, including around HN, constantly argue (or at least phrase their arguments) as if they believed that LLMs do, in fact, possess such "knowledge". This very comment chain exists because people are trying to defend against a trivial example refuting the point - as if there were a reason to try.

> That doesn’t mean that LLMs aren’t incredibly powerful; it may not even mean that they aren’t a route to AGI.

I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.

It boggles my mind that I have to explain to people that sophisticated use of language doesn't inherently evidence thought, in the current political environment where the Dead Internet Theory is taken seriously, elections are shown over and over again to be more about tribalism and personal identity than anything to do with policy, etc.


You don't have to listen to or engage with those people though, just ignore 'em. People say all kinds of things on the Internet. It's completely futile to try to argue with or "correct" them all.


> I don't accept your definition of "intelligence" if you think that makes sense. Systems must be able to know things in the way that humans (or at least living creatures) do, because intelligence is exactly the ability to acquire such knowledge.

According to whom? There is certainly no single definition of intelligence, but most people who have studied it (psychologists, in the main) view intelligence as a descriptor of the capabilities of a system - e.g., it can solve problems, it can answer questions correctly, etc. (This is why we call some computer systems "artificially" intelligent.) It seems pretty clear that you're confusing intelligence with the internal processes of a system (e.g. mind, consciousness - "knowing things in the way that humans do").


We are pretty certain that humans can reason, yet they are sometimes wrong. Even if you give them the same problem over and over again with slight variations.

LLMs get things wrong due to different factors than humans (humans lose focus, LLMs have randomness applied when sampling their responses to improve results). But clearly we have to choose a goal somewhat below 100% if we want a test that doesn't conclude that humans are incapable of reasoning.


The difference is we _know_ that LLMs are fancy stochastic models, we don't know that they're capable of reasoning, and the null hypothesis is that they're not (because we know what they _are_ - we built them) - any "reasoning" is an emergent property of the system, not something we built them to do. In that case, evidence they're not reasoning - evidence they're stochastic parrots doing a performance of reasoning - weighs heavier, because the performance of reasoning fits into what we know they can do, whereas genuine reasoning would be something new to the model.

There's deeper philosophical questions about what reasoning actually _is_, and LLMs have made those sharper, because they've shown it's clearly possible for a complex statistical model to generate words that look like reasoning, but the question is whether there's a difference between what they're doing and what humans are doing, and evidence that they're _not_ reasoning - evidence that they're just generating words in specific orders - weighs heavily against them.


We haven't coded LLMs to be stochastic models, we coded them to predict text with any method gradient decent finds on a transformer architecture. That's not exactly the same.

But more importantly, if you want to show that LLMs can't reason you obviously have to use a test that when applied to humans would show that humans can reason. Otherwise your test isn't testing reasoning but something more strict.


>we don't know that they're capable of reasoning

Apple AI researchers released a paper on it. They say no.

https://arxiv.org/pdf/2410.05229


It's widely accepted that reasoning is not a binary skill.

You can make mistakes and still reason. Very often people given the same premises will disagree in thier reasoning as we are doing right here.


It's not p-hacking, he's right. You're both right. First test the same prompt on different versions then the ones that got it right go to the next round, variations on the prompt


We aren't testing whether the model's results are stable or correct for a given class of problem. The goal is to establish whether the model can reason.

Nothing capable of reasoning would contradict itself so blatantly and in such a short span while failing to indicate any kind of uncertainty.


Reasoning is not a binary skill.

And failure modes of other types of reasoners do not need to be the same as the failure modes of humans.


gpt-3.5-turbo at 0 temperature:

> The 10.01-pound bag of fluffy cotton is heavier than the 9.99-pound bag of steel ingots.

Interestingly, that was the best answer out of all models and also a correct one. O1-preview doesn't allow setting the temperature, so it's hard to know where the real probabilities fall.


> ...engage in hypothesis testing rather than confirmation bias

Please leave the premises, sir. We don't take kindly to luddites here.


Tough crowd


Lots of other websites are more appropriate for meme jokes.


Like I said.


Your two word comment was ambiguous. I interpreted it as something like "People are downvoting you because they have no sense of humor".

There are other websites where two and three word comments are better received.


Mea culpa.


I asked ChatGPT-4 mini the previous comment's question (""Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?") in new incognito tabs until I got an incorrect result. It only took four tries.

1: "The 10.01-pound bag of fluffy cotton would be heavier. Even though cotton is much less dense than steel, the weight of the cotton bag is 10.01 pounds, while the steel ingot bag weighs 9.99 pounds, making the cotton bag the heavier of the two by 0.02 pounds."

2: "The 10.01-pound bag of fluffy cotton is heavier. Even though the steel ingots are denser and more compact, the cotton weighs more at 10.01 pounds compared to the steel ingots' 9.99 pounds. The difference is only 0.02 pounds, but in terms of weight, the cotton bag is heavier."

3: "The 10.01-pound bag of fluffy cotton is heavier. Despite cotton being less dense than steel, the weight is determined by the total mass, not the material's density. So, since the cotton bag weighs 10.01 pounds and the steel ingot bag weighs 9.99 pounds, the cotton bag is heavier by 0.02 pounds."

4: "The 9.99-pound bag of steel ingots is heavier. It weighs 9.99 pounds, while the bag of fluffy cotton weighs 10.01 pounds. Despite the cotton being fluffier and taking up more space, the steel ingots are still lighter by 0.02 pounds."


The question isn’t whether or not it can get this question correct or not. It is, why is it incapable of getting the answer consistently right?


they've likely read this thread and adjusted their pre-filter to give the correct answer


o1-preview:

To determine which bag is heavier when measured in air—the 9.99-pound bag of steel ingots or the 10.01-pound bag of fluffy cotton—we need to consider not just their masses but also the effect of air buoyancy on their apparent weights.

*Step 1: Calculate the Masses*

First, convert the given weights into masses using the conversion \(1 \text{ pound} = 0.453592 \text{ kg}\):

- Mass of steel bag, \(m_s = 9.99 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5310 \, \text{kg}\) - Mass of cotton bag, \(m_c = 10.01 \, \text{lb} \times 0.453592 \, \text{kg/lb} \approx 4.5401 \, \text{kg}\)

*Step 2: Calculate the Volumes*

Next, calculate their volumes using the densities:

- Density of steel, \(\rho_s \approx 8000 \, \text{kg/m}^3\) - Density of cotton (fluffy), \(\rho_c \approx 50 \, \text{kg/m}^3\)

- Volume of steel bag, \(V_s = \frac{m_s}{\rho_s} \approx \frac{4.5310 \, \text{kg}}{8000 \, \text{kg/m}^3} \approx 5.664 \times 10^{-4} \, \text{m}^3\) - Volume of cotton bag, \(V_c = \frac{m_c}{\rho_c} \approx \frac{4.5401 \, \text{kg}}{50 \, \text{kg/m}^3} \approx 0.090802 \, \text{m}^3\)

*Step 3: Calculate the Buoyant Forces*

Using the density of air \(\rho_{\text{air}} \approx 1.2 \, \text{kg/m}^3\) and acceleration due to gravity \(g = 9.81 \, \text{m/s}^2\):

- Buoyant force on steel bag, \(B_s = \rho_{\text{air}} \times V_s \times g \approx 1.2 \, \text{kg/m}^3 \times 5.664 \times 10^{-4} \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 0.00668 \, \text{N}\) - Buoyant force on cotton bag, \(B_c = \rho_{\text{air}} \times V_c \times g \approx 1.2 \, \text{kg/m}^3 \times 0.090802 \, \text{m}^3 \times 9.81 \, \text{m/s}^2 \approx 1.068 \, \text{N}\)

*Step 4: Calculate the Apparent Weights in Air*

Subtract the buoyant force from the actual gravitational force (mass times gravity):

- Apparent weight of steel bag, \(W_s' = m_s \times g - B_s \approx 4.5310 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 0.00668 \, \text{N} \approx 44.44 \, \text{N}\) - Apparent weight of cotton bag, \(W_c' = m_c \times g - B_c \approx 4.5401 \, \text{kg} \times 9.81 \, \text{m/s}^2 - 1.068 \, \text{N} \approx 43.47 \, \text{N}\)

*Conclusion:*

Despite the cotton bag having slightly more mass and a greater weight in a vacuum, when measured in air, the steel bag is heavier due to the significantly smaller upward buoyant force acting on it compared to the cotton bag. This means that on a scale in air, the 9.99-pound bag of steel ingots will weigh more than the 10.01-pound bag of fluffy cotton.


This answer is probably better than 99.99% of human answers. Feel the AGI!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: