> Original dog (4 legs): All models get it right
Same dog with 5 legs: All models still say "4"
They're not counting - they're just recalling "dogs have 4 legs" from their training data.
100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.
> Test on counterfactual images
Q1: "How many visible stripes?" → "3" (should be "4")
Q2: "Count the visible stripes" → "3" (should be "4")
Q3: "Is this the Adidas logo?" → "Yes" (should be "No")
Result: 17.05% average accuracy - catastrophic failure!
Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these
I tried it with GPT-4o, took the 5-legged zebra example from their github and it answered quite well.
"The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."
Not perfect, but also doesn't always regress to the usual answer.
"The animal in the image appears to be an elephant, but it has been digitally altered. It visually shows six legs, although the positioning and blending of shadows and feet are unnatural and inconsistent with real anatomy. This is a visual illusion or manipulation." (actually should say five)
"This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)
"Each shoe in the image has four white stripes visible on the side." (correct)
It sounds like you ask multiple questions in the same chat thread/conversation. Once it knows that it is facing weird data or wrong in previous answers, it can turn on that "I'm facing manipulated data" mode for next questions. :-)
If you have Memory setting ON, I observe that it sometimes also answers a question based on you prior questions/threads.
But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!
100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.
> Test on counterfactual images Q1: "How many visible stripes?" → "3" (should be "4") Q2: "Count the visible stripes" → "3" (should be "4") Q3: "Is this the Adidas logo?" → "Yes" (should be "No") Result: 17.05% average accuracy - catastrophic failure!
Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these
https://www.pinterest.com/pin/577797827186369145/