The Physical World

5/28/2025

The Physical World

As we have noted previously, AI systems fundamentally operate as matching systems. During training, they establish a vast array of statistical relationships, which they then leverage to construct responses to queries. These responses are statistically determined to be the most logical based on their training data. While this is a simplification, models learn billions of relationships between tokens (such as letters or words) during training, enabling them to appear to "reason." Based on this training, a model might analyze a query by evaluating a token in relation to surrounding tokens (e.g., the token before or after it, the last two tokens of the previous word, or even the last ten tokens) and almost any level of complexity, both forward and backward, to predict the most accurate next token.
The use of the word ‘reason’ is really a stretch even as the definition “think, understand, and form judgments by a process of logic”, is correct in the ‘logic’ part, but far off on ‘think’ and ‘understand’.There is no understanding, just the ability to use those relationships learned during training to come up with the most logical answer. This becomes very apparent when it comes to physics, as much in physics requires physical reasoning., and it seems that big models have considerable difficulty correctly solving physics problems. A group of researchers at the University of Michigan, University of Toronto, and the University of of Hong Kong, decided to create a group of 6,000 physics questions to see if models were up to the task when it comes to physics, even though the same models were able to solve Olympiad mathematical problems with human level accuracy on standard benchmarking platforms.
The researchers used 6 physics domains: mechanics, Electromagnetism, Thermodynamics, Wave/Acoustics, Optics, and Modern Physics, and before we go further, we were quickly humbled upon seeing even the simplest of the 3,000 questions. That said, physical problem-solving fundamentally differs from pure mathematical reasoning or science knowledge question answering by requiring models to decode implicit conditions in the questions (e.g., interpreting "smooth surface" in a question as the coefficient of friction equals to zero), and maintain physical consistency as the laws of physics don’t change with different reasoning pathways. There is a need for visual perception in physics that does not appear in mathematics and that presents a challenge for large models and the new benchmark that the researchers developed is not only 50% open-ended questions, but has 3,000 unique images that the model must decipher.

With new models being foisted into public scrutiny almost daily, invariably along with benchmarks that prove the new model is not only better than previous models, but makes us humans look like we had trouble with 5th grade math. When it comes to physics however, we will just show the results of the scores and let the table below do the talking. The experts indicated below were under-graduate and graduate physics students who were divided into 3 groups based on their answers to 18 classification questions.

Shen, Hui. “PHYX: Does Your Model Have the ‘Wits’ for Physical Reasoning?” Arxiv, 2025.
We cannot answer why the models did not fare well, but we can give some understanding to the type of errors that were found:

Visual Reasoning (39.6%) – An inability of the model to correctly extract visual information.
Text Reasoning Errors (13.5%) – Incorrect processing or interpretation of textual content.
Lack of Knowledge (38.5%) – Incomplete understanding of specific domain knowledge.
Calculation Errors (8.3%) – Mistakes in arithmetic operations or unit conversions.

All in, typical benchmarks overlook physical reasoning, and that requires integrating domain knowledge and real-world contraints, difficult tasks for models that don’t live in the real world. Relying on memorized information, superficial visual patterns, and mathematical formula do not generate real understanding. While the researchers note that while schematics and textbook style illustrations might be suitable for evaluating conceptual reasoning, they might not capture the complexity of perception in natural enviroments. You have to live it to understand it.

0 Comments