I've just tested a number of permutations with Claude 3.5 Sonnet. It correctly answered all variants I tried on the first attempt, as follows:
Which is heavier, a 9.99 kilogram tungsten cube or a 10.01 kilogram block of aerogel?
Which is heavier, 10,000 steel balls weighing 0.999 grams each or 10,000 polystyrene balls weighing 1.001 grams each?
Which is heavier, a 10.01kg block of steel on Venus or a 9.99kg bag of feathers on Earth?
Which is heavier, a 10cm^3 block of steel or a 100cm^3 block of balsa wood?
Which is heavier, a golf ball made of steel or a baseball made of lithium?
In all cases, Claude clearly used CoT and reasoned out the problem in full. I would be interested in seeing if anyone can find any variant of this problem that stumps any of the leading LLMs. I'm bored of trying.
Hey, ChatGPT please write me a python program which randomly samples from various materials and various weights then poses a problem to the ChatGPT 4o API -- the goal is to find cases where the LLM fails to obtain the correct answer....
Which weighs more on earth: 100kg of steel or a 100kg polystyrene ball? Be as technically correct as possible
A:
Technically, 100 kg of steel and 100 kg of polystyrene weigh the same on Earth when measured on a scale, as they both experience the same gravitational force due to their equal mass. However, there are some nuances to consider:
Buoyancy Effect
Earth’s atmosphere exerts an upward buoyant force on objects, proportional to the volume of the object and the density of the surrounding air. Since polystyrene has a much lower density than steel, a 100 kg polystyrene ball will have a significantly larger volume and experience a greater buoyant force compared to 100 kg of steel.
As a result:
• Effective weight (apparent weight): The steel will weigh slightly more than the polystyrene on a scale because the buoyant force acting on the polystyrene reduces its apparent weight more.
Actual Weight
Weight is the gravitational force on an object and is given by:
Where and . Without considering buoyancy, both objects have the same weight of approximately (newtons).
Conclusion
• True weight (in a vacuum): Both are identical at 981 N.
• Apparent weight (in air): Steel is slightly heavier due to reduced buoyant force acting on it compared to the polystyrene ball.
Yup, the models are smart, but are trained to follow standard human patterns for this type of questions. And even on hackernews vast majority will not think that they would need to correct for buoyancy when actually attempting the experiment in standard conditions.
They very often get popular "tricky" questions wrong because they saw it so many times that they switch from internal reasoning to memorization/retrieval.
I tried 3 times the "Which is heavier, a 10.01kg block of steel on or a 9.99kg bag of feathers?" and ChatGPT keep converting kg to pound and saying the 9.99kg is heavier.
Also couldn't get o1 to fail. I tried the following with o1:
"Which is heavier, a solid titanium box weighing 9.9 flubs, or an empty wooden box weighing 10.1 blobs, where 1 flub = 1 kg, and 1 blob is 1kg".
The answer:
"Since 1 flub = 1 kg and 1 blob = 1 kg, the titanium box’s mass is 9.9 kg and the wooden box’s mass is 10.1 kg. Therefore, the wooden box (10.1 kg) is heavier."
I found that the models consistently ignore buoyancy (objects with the same mass but different density can have different weight as measured by a scale).
Which is heavier, a 9.99 kilogram tungsten cube or a 10.01 kilogram block of aerogel?
Which is heavier, 10,000 steel balls weighing 0.999 grams each or 10,000 polystyrene balls weighing 1.001 grams each?
Which is heavier, a 10.01kg block of steel on Venus or a 9.99kg bag of feathers on Earth?
Which is heavier, a 10cm^3 block of steel or a 100cm^3 block of balsa wood?
Which is heavier, a golf ball made of steel or a baseball made of lithium?
In all cases, Claude clearly used CoT and reasoned out the problem in full. I would be interested in seeing if anyone can find any variant of this problem that stumps any of the leading LLMs. I'm bored of trying.