If you use any of the conventional tests that exist of theory of mind (most famously the Sally-Anne Test [1] but also the others) then SOTA reasoning models will get near 100%. Even if you try to come up with similar questions which you expect not to be in the training set they will still get them right.
In the absence of any evidence to the contrary, this is convincing evidence in my opinion.
That same source you link says that your view of 100% is not accepted as a consesus:
"... GPT-4's ability to reason about the beliefs of other agents remains limited (59% accuracy on the ToMi benchmark),[15] and is not robust to "adversarial" changes to the Sally-Anne test that humans flexibly handle.[16][17] While some authors argue that the performance of GPT-4 on Sally-Anne-like tasks can be increased to 100% via improved prompting strategies,[18] this approach appears to improve accuracy to only 73% on the larger ToMi dataset."
In basically every case, by the time a claim like that is stated in a paper like that, it's obsolete by the time it's published, and ancient history by the time you use it to try to win an argument.
My point is merely if you are going to make an argument using a source, the source should support your argument. If you say "the accuracy of an llm on task 1 is 90% [1]" and when you go to [1] it says the accuracy of an llm on task 1 is 50%, but some sources say with better prompts you can get to 90%, but when extended to a larger data-set for task 1, performance drops to 70%" then just quoting the highest number is mis-leading.
Maybe having a theory of mind isn't the big deal we thought it was. People are so conditioned to expect such things only from biological lifeforms, where theory of mind comes packaged with many other abilities that robots currently lack, that we reflexively dismiss the robot.
In the absence of any evidence to the contrary, this is convincing evidence in my opinion.
[1] https://en.wikipedia.org/wiki/Sally%E2%80%93Anne_test