Current models are quite far away from human-level physical reasoning (paper below). An upcoming version of models trained on world simulation will probably do much better.
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
This is more about a physics math aptitude test. You can already see that the best model in math is saturating it halfway. It might not indicate its usefulness in actual physical reasoning, or at the very least, it seems like a bit of a stretch.
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
https://phybench-official.github.io/phybench-demo/